<<

AN II.{TRODIJCTIONTO INFWATION THE,ORY Symbols, &INoise

JOHNR.PIERCE Professor of Engineering California Institute of Technology

Second,Revised Edition

DoverPublications, I nc. NewYork TO CLAUDE AND BETTY SHANNON

Copyright O 196l , 1980 by John R. Pierce. All rights reserved under Pan American and International Copyright Conventions.

This Dover edition, first published in 1980, is an unabridged and revised version of the work originally published in 196l by Harper & Brothers under the title Symbols, Signals and Noise' The Nature and Process of Cornmunication.

International Standard Book Number' 0-486-2406 1-4 Catalog Card Nurnber' 80-66678

Manufactured in the of America Dover Publications, Inc. 31 East Znd Street, Mineola, N.Y. 11501 Contents

FnnrlcE To rHE Dovnn EolrroN vii r. THn WonrD AND TsEonlEs I II. THE OnrcrNS oF INronMATroN TuEoRy t9 III. A M,q,THEMATICALMopEr 45 IV. ENcoorNG AND Bmnny Dlclrs 64 V. ENrnopy 78 VI. LINcUAGE AND ME.q.NING r07 uI. ErrrcrENT ENcoorNG r25 VIII. THE Norsy CHINNEL 145 IX. MnNy DnraENSroNS 166 X. INronMATroN THnoRy AND PHyslcs 184 XI. CynrnNETrcs 208 XII. INnonMATroNTHnoRy AND PsycuoI-ocy 229 xm. INronMATroN THroRy AND Anr 250 XIV. Bncr ro CoMMUNTcATToNTHEoRy 268 AppnNDrx: ON MarnEMATrcnr NorATroN 278 GrossARy 287 INoEx 295

Prefaceto the Doyer Edition

THs REPUBLIcATIoNoF THrs BooK gave me an opportunity to conect and bring up to date Symbols,Sfgnals and Noise,l which I wrote almost twenty years ago. Becausethe book deals largely with Shannon'swork, which remainseternally valid, I found itrat there were not many changesto be made.In a few placesI altered tensein referring to men who have died. I did not try to replace cyclesper second(cps) by the more modern term, hertz 1nd nor did I changeeverywhere theory (Shannon'sterm) to theory, the term I would usetoday. Some things I did alter, rewriting a few para$aphs and about twenty pageswithout changingthe pagination. In Chapter X, and Physics,I replaced a backgroundradiation temperature of spaceof "2o to 4oK" (Heaven knows where I got that) by the correctvalue of 3.5oK, as deter- mined by Penziasand Wilson. To the fact that in the absenceof noisewe can in principletransmit an unlimited number of per quantuffi,f addednew material on quantumeffects in communica- tion.2 I also replacedan obsoletethought-up example of space communicationby a brief analysisof the transmiJsion of picture signalsfrom the Voyager near Jupiter, and by an exposi- tion of new possibilities. r Harper Modern ScienceSeries, Harper and Brothers, New York, 196l. 2 See Introduction to Communication Science snd , John R. Pierce and Edward C. Posner, Plenum Publishing Coqporation, New York, 1980. vlll Symbols, Signals and Noise

In Chapter VII, Efficient Encoding, I rewrote a few pages con- cerning efficient source encodirg of TV and changed a few sen- tences about pulse code and about vocoders. I also changed the material on error correcting codes. In Chapter XI, , f rewrote four pages on and programming, which have advanced incredibly during the last twenty years. Finally, I made a few changes in the last short Chapter XIV, Back to . Beyond these revisions, I call to the reader's attention a series of papers on the history of information theory that were published in 1973 in the IEEE Transactions on Information TheoryT and two up-to-date books as telling in more detail the present state of information theory and the mathematical aspects of com- munication.2'a'5 Several chapters in the original book deal with areas relevant only through application or attempted application of information theory. I think that Chapter XII, Information Theory and Psychology, gives a fair idea of the sort of applications attempted in that area. Today psychologists are less concerned with information theory than with cognitive science, a heady association of truly startling progress in the understanding of the nervous , with ideas drawn from anthropology, linguistics and a belief that some power- ful and simple mathematical order must underly human function. Cognitive science of today reminds me of cybernetics of twenty years ago. As to Information Theory and Art, today the has re- placed information theory in casual discussions. But, the ideas explored in Chapter XIII have been pursued further. I will mention some attractive poems produced by Marie Borroff6'7, and, es-

3 IEEE Transactions on Information Theory, Vol. IT-19, pp. 3-8, 145- t48, 257-262,381-389 (t973) . a The Theory of Information and Coding,Robert J. McEliece, Addison- Wesley,Reading, MA, 1977. 5 Principles of Digital Communication and Coding, Andrew J. Viterbi and Jim K. Omura, McGraw Hill, New York, 1979. 6 "Computer as Poet," Marie Borroff, Yale Alumni Magazine, Ian. 1971. r Computer Poems, gathered by Richard Bailey, Potagannissing Press, 1973. tx pecially a grammar of Swedishfolksongs by means of which Johan Sundberg produced a number of authentic sounding tunes.8 This brings us back to language and Chapter VI, Language and Meaning. The problems raised in that chapter have not been re- solved during the last twenty years. We do not have a complete grammar of any natural language. Indeed, formal grammar has proved most powerful in the area of computer languages. It is my reading that attention in linguistics has shifted somewhat to the phonological aspects of spoken language, to understanding what its building blocks are and how they interacf-pstters of great interest in the computer generation of speech from text. Chomsky and Halle have written a large book on stresS,eand Liberman and Prince a smaller and very powerful account.l0 So much for changes from the original Signals, Symbols and Noise. Beyond this, I can only reiterate some of the things I said in the preface to that book. When James R. Newman suggested to me that I write a book about communication I was delighted. All my technical work has been inspired by one aspect or another of communication. Of course I would like to tell others what seems to me to be interest- ing and challenging in this important field. It would have been difficult to do this and to give any sense of unity to the account before 1948 when Claude E. Shannon pub- lished "A Mathematical Theory of Communication."ll Shannon's communication theory, which is also called information theory, has brought into a reasonable relation the many problems that have been troubling communication engineersfor years. It has created a broad but clearly defined and limited field where before there were many special problems and ideas whose interrelations were not well 8 "GenerativeTheories in Languageand Musical Descriptions,"Johan Sundbergand Bjorn Lindblom,Cognition, Vol. 4, pp. 99-122, 1976. s The SoundPattern of English,N. Chomskyand M. Halle, Harper and Row,1968. 10"On Stressand LinguisticRhythm," Mark Liberman and Alan Prince, LinguisticInqulry, Vol. 8, No. 2, pp.249-336,Spring, 1,977 . rr The papers,originally published in the Bell SystemTechnical Journal, are reprinted in The Mathematical Theory of Communication, Shannon and Weaver, LJniversity of Illinois Press, first printing 1949. Shannon presented a somewhat different approach (used in Chapter IX of this book) in "Com- munication in the Presenceof Noise," Proceedings of the Institute of Engineers,Vol. 37,pp. lGzI, 1949. Symbols, Signalsand Noise understood. No one can accuse me of being a Shannon worshipel and get away unrewarded. Thus, I felt that my account of cornmunication must be an account of information theory as Shannon formulated it. The account would have to be broader than Shannon's in that it would discuss the relation, or lack of relation, of information theory to the many fields to which people have applied it. The account would have to be broader than Shannon's in that it would have to be less mathematical. Here came the rub. My account could be /ess mathematical than Shannon's, but it could not be nonmathematical. Information theory is a mathematical theory. It starts from certain premises that define the aspects of communication with which it will deal, and it proceeds from these premises to various logical conclusions. The glory of information theory lies in certain mathematical theorems which are both surprisirg and important. To talk about information theory without communicatirg its real mathematical content would be like endlessly telling a man about a wonderful composer yet never letting him hear an example of the composer's music. How was I to proceed? It seemedto me that I had to make the book self-contained, so that any in it could be under- stood without referring to other books or without calling for the particular content of early mathematical training, such as high school algebra. Did this mean that I had to avoid mathematical notation? Not necessarily, but any mathematical notation would have to be explained in the most elementary terms. I have done this both in the text and in an appendix; by going back and forth between the two, the mathematically untutored reader should be able to resolve any difficulties. But just how difficult should the most difficult mathematical arguments be? Although it meant sliding over some very important points, f resolved to keep things easy compared with, sol, the more difficult parts of Newman's The World of Mathematics. When the going is vefy difficult, I have merely indicated the general nature of the sort of mathematics used rather than trying to describe its con- tent clearly. Nonetheless, this book has sections which will be hard for the nonmathematical reader. I advise him merely to skim tfrro.rl these, gathering what he can. When he has gone through the book in this manner, he will seewhy the difficult sections are there. Then he can turn back and restudy them if he wishes. But, had I not put these difficult sections in, and had the reader wanted the sort of understandirg that takes real thought, he would have been stuck. As far as I know, other available literature on information theory is either too simple or too difficult to help the diligent but inexpert reader beyond the easier parts of this book. I might note also that some of the literature is confused and some of it is just plain wrong. By this sort of talk I may have raised wonder in the reader's mind as to whether or not information theory is really worth so much trouble, either on his part, for that matter, or on mine. I can only say that to the degree that the whole world of science and technology around us is important, information theory is important, for it is an important part of that world. To the degree to which an intelligent reader wants to know somethirg both about that world and about information theory, it is worth his while to try to get a clear picture. Such a picture must show information theory neither as something utterly alien and unintelligible nor as something that can be epitomized in a few easy words and appreciated without eftort. The processof writing this book was not easy. Of course it could never have been written at all but for the work of Claude Shannoo, who, besidesinspiring the book through his work, read the original manuscript and suggestedseveral valuable changes. jolted me out of the rut of error and confusion in an even more vigorous way. E. N. Gilbert deflected me from error in several instances. Milton Babbitt reassured me concerning the major contents of the chapter on information theory and art and suggested a few changes.P. D. Bricker, H. M. Jenkins, and R. N. Shepard advised me in the field of psycholog|, but the views I finally ex- pressed should not be attributed to them. The help of M. V. Mathews was invaluable. Benoit Mandelbrot helped me with Chapter XII. J. P. Runyon read the manuscript with care, and Eric Wolman uncovered an appalling number of textual errors, and made valuable suggestions as well. I am also indebted to Prof. Martin Harwit, who persuaded me and Dover that the book was Symbols, Signals and Noise worth reissuing. The reader is indebted to James R. Newman for the fact that I have provided a glossar!, summaries at the ends of some chapters, and for my final attempts to make some difficult points a lit-tle clearer. To all of these I am indebted and not less to Miss F. M. Costello, who triumphed over the chaos of preparing and correcting the manuscript and figures. fn preparing this new edition, I owe much to my secreta{I, Mrs. Patricia J. Neill.

September,1979 J. R. Ptnncn CHAPTERI The World and Theories

IN 1948,CrnunE E. SnnNNoNpublished a paper called "A Mathematical Theory of Communication"; it appearedin book form in 1949.Before that time, a few isolated workers had from time to time taken stepstoward a generaltheory of communication. Now, thirty years later, communicationtheory, or information theory as it is sometimescalled, is an acceptedfield of research. Many books on communicationtheory have been published,and many international symposia and conferenceshave been held. The Institute of Electrical and Electronic Engineershas a pro- fessionalgroup on informationtheory, whose Transactionsappear six times a year. Many other journals publish papers on informa- tion theory. All of us use the words communication and information, and we are unlikely to underestimate their importance. A modern philosopher,A. J. Ayer, has commentedon the wide meaning and importance of communication in our lives. We communicate,he observes,not only information, but alsoknowledge, error, opinions, ideas,experiences, wishes, orders, emotions, feelings, moods. Heat and motion can be communicated.So can strength and weakness and disease.He citesother examplesand commentson the mani- fold manifestations and puzzltng features of communication in manosworld. Surely, communication being so various and so important, a Symbols, Signals and lr/oise

theory of communication, a theory of generally accepted soundness and usefulness, must be of incomparable importance to all of us. When we add to theory the word mathematical, with all its impli- cations of rigor and magic, the att:raction becomes almost irre- sistible. Perhaps if we learn a few formulae our problems of communication will be solved, and we shall become the masters of information rather than the slaves of misinformation. Unhappily, this is not the course of science. Some 230A years ego, another philosopher, Aristotle, discussed in his Physics a notion as universal as that of communication, that is, motion. Aristotle defined motion as the fulfillment, insofar as it exists potentially, of that which exists potentially. He included in the concept of motion the increase and decreaseof that which can be increased or decreased,coming to and passing away, and also being built. He spoke of three categories of motion, with respect to magnitude, affectior, and place. He foutrd, indeed, &She said, &S many types of motion as there are meanings of the word ls. Here we see motion in all its manifest complexity. The com- plexity is perhaps a little bewildering to us, for the associationsof words differ in different languages, and we would not necessarily associ ate motion with all the changes of which Aristotle speaki. How puzzltng this universal matter of motion must have been to the followers of Aristotle. It remained puzzling for over two millennia, until Newton enunciated the laws which engineersstill use in designing machines and astronomers in studying the motions of stars, planets, and satellites. While later physicists have found that Newton's laws are only the special forms which more general laws assume when velocities are small compared with that of light and when the scale of the phenomena is large compared with the atom, they are a living part of our physics rather than a historical monument. Surely, when motion is so important a part of our world, we should study Newton's laws of motion. They say: l. A body continues at rest or in motion with a consiuttiu.locity in a straight line unless acted upon by u force. 2. The change in velocity of a body is in the direction of the force acting on it, and the magnitude of the change is proportional to the force acting on the body times the tirne during which the force acts, and is inversely proportional to the mass of the body. The World and Theories 3. Whenevera first body exertsa force on a secondbody, the secondbody exertsan equal and oppositely directed force on the first body. To theselaws Newton addedthe universallaw of gravitation: 4. Two particlesof matter attract one another with a force act- irg along the line connectingthem, a force which is proportional to the product of the massesof the particlesand inverselypropor- tional to the squareof the distanceseparating them. Newton's laws brought about a scientific and a philosophical revolution. Using them, Laplace reduced the solar systemto an explicablemachine. They have formed the basisof aviation and rocketry,as well asof astronomy.Yet, they do little to answermany of the questionsabout motion which Aristotle considered.New- ton's laws solved the problem of motion as Newton defined it, not of motion in all the sensesin which the word could be usedin the Greek of the fourth century before our Lord or in the English of the twentieth century after. Our speechis adaptedto our daily needsor, perheps,to the needs of our ancestors.We cannothave a separateword for everydistinct object and for everydistinct event; if we did we should be forever coining words, and communicationwould be impossible.In order to have language at all, many things or many events must be referred to by one word. It is natural to say that both men and horsesrun (though we may prefer to say that horsesgallop) and convenient to say that a motor runs and to speak of a run in a stockingor a run on a bank. The unity among theseconcepts lies far more in our human languagethan in any physical similarity with which we can expect scienceto dealeasily and exactly.It would be foolish to seeksome elegant,simple, and useful scientifictheory of runrirg which would embrace runs of salmon and runs in hose. It would be equally foolish to try to embracein one theory all the motions discussed by Aristotle or all the sorts of communication and information which later philosophershave discovered. In our everydaylanguage, we use words in a way which is con- venient in our everydaybusiness. Except in the study of language itself, sciencedoes not seekunderstanding by studying words and their relations.Rather, science looks for thingsin nature,including l I Symbols, Signals and Noise I our human nature and activities,which can be groupedtogether I and understood. Such understandingis an ability to see what complicated or diverse events really do have in common (the planetsin the heavensand the motions of a whirling skateron ice, for instance)and to describethe behavior accuratelyand simply. The words used in such scientific descriptionsare often drawn from our everydayvocabulary. Newton usedforce, mass, velocity, and attraction. When used in science,however, a particular mean- itg is given to such words, a meaning narrow and often new. We cannot discussin Newton's terms force of circumstance,, or the attraction of Brigitte Bardot. Neither should we expect that cofirmunication theory will have somethingsensible to say about every questionwe can phraseusing the words communi- cation or information. A valid scientific theory seldom if ever offers the solution to the pressingproblems which we repeatedlystate. It seldomsupplies a sensible answer to our multitudinous questions.Rather than ration altzing our ideas, it discards them entirely, or, rather, it leavesthem as they were. It tells us in a fresh and new way what aspects of our experiencecan profitably be related and simply understood.In this book, it will be our endeavorto seekout the ideas concerning communication which can be so related and understood. When the portions of our experiencewhich can be relatedhave been singledout, and when they have beenrelated and understood, we have a theory concerning these matters. Newton's laws of motion form an important part of theoreticalphysics, a field called mechanics.The laws themselvesare not the whole of the theory; they are merely the basis of it, as the axioms or postulatesof geometryare the basisof geometry.The theory embracesboth the assumptionsthemselves and the mathematicalworkirg out of the logical consequenceswhich must necessarily follow from the assumptions. Of course, these consequencesmust be in accord with the complex phenomenaof the world about us if the theory is to be a valid theory, and an invalid theory is useless. The ideas and assumptionsof a theory determine the generality of the theory, that is, to how wide a range of phenomenathe theory applies.Thuso Newton's laws of motion and of gravitation The World and Theories 5 are very general;they explain the motion of the planets,the time- keepingproperties of a pendulum,and the behavior of all sortsof maitrines and mechanisms.They do not, however,explain radio waves. Maxwell's equationslexplain all (non-quantum) electricalphe- nomena; they are very general.A branch of electricaltheory called network theory deals with the electrical properties of electrical circuits,or networks,made by interconnectingthree sorts of ideal- tzedelectrical structures: resistors (devices such as coils of thin, poorly conductingwire or films of metal or carbon, which impede Itre flow of current), inductors (coils of copper wire, sometimes wound on magnetic cores),and capacitors(thin sheetsof metal separatedby an insulatoror dielectricsuch as mica or plastic; th,e Leyden jar was an early form of capacitor). Because network theory dealsonly with the electricalbehavior of certain speciali-ed and tdeahzedphysical structures, while Maxwell's equationsde- scribethe electricalbehavior of any physical structure, a physicist would say that network theory is /essgeneral than are Maxwell's equations,for Maxwell'sequations cover the behavior not only of id-ealizedelectrical networks but of all physical structures and include the behaviorof radio waves,which lies outside of the scope of network theory. Certainly, the most generaltheory, which explainsthe greatest range of phenomena,is the most powerful and the best; it can alwaysbe specialtzedtodeal with simplecases. That is why physi- cists have iought a unified field theory to embrace mechanical laws and gravitationand all electricalphenomena. It might, indeed, seemthat all theoriescould be ranked in order of generality,and, if this is possible,we should certainly like to know the place of communicationtheory in such a hierarchy. Unfortunately, life isn't as simple as this. In one sense,network theory is lessgeneral than Maxwell's equations.In another sense,

1[n 1873, in his treatiseElectrictity and Magnetism, James Clerk Maxwell pre- sented and fully explained for the first time the natural laws relating electric and magnetic fields and electric currents. He showed that there should be electromagnetic ,oirt (radio waves) which travel with the speed of light. Hertz later demonstrated these experimentally, and we now know that light is electromagnetic waves. Max- well's equations are the mathematical statement of Maxwell's theory of electricity and magnetism. They are the foundation of all electric art. Symbols, Signals and Noise

however, it is more general, for all the mathematicalresults of network theory hold for vibrating mechanicalsystems made up of idealized mechanical components as well as for the behavioi of interconnectionsof idealned electricalcomponents. In mechanical applications,a spring correspondsto a capacitor, a massto an inductor, and a dashpot or damper, such as that used in a door closer to keep the door from slamming,corresponds to a resistor. In fact, network theory might have been developedto explainthe behavior of mechanicalsystems, and it is so used in the field of acoustics.The fact that network theory evolvedfrom the study of idealizedelectrical systemsrather than from the study of ideahzed mechanicalsystems is a matter of history, not of necessity. Becauseall of the mathematical results of network theory apply to certain specializedand idealized mechanicalsystems, as well as to certain specialtzedand ideahzedelectrical systems,we can say that in a sensenetwork theory is more general than Maxwellts equations,which do not apply to mechanicalsystems at all. In another sense,of course,Maxwell's equationsare more general than network theo{y,for Maxwell's equationsapply to all electrical systems,not merely to a specializedand tdealizedclass of electrical circuits. To some degreewe must simply admit that this is so,without being able to explain the fact fully. Yet, we can say this much. Some theoriesare very stronglyphysical theories. Newton's laws and Maxwell's equations are such theories. Newton's laws deal with mechanicalphenomena; Maxwell's equationsdeal with elec- trical phenomena.Network theory is essentially a mathematical theory. The terms used in it can be given various physicalmean- ings. The theory hasinteresting things to sayabout differentphysi- cal phenomena,about mechanicalas well as electricalvibrations. Often a mathematicaltheory is the offshoot of a physicaltheory or of physicaltheories. It can be an elegantmathematical formula- tion and treatmentof certain aspectsof a generalphysical theory. Network theory is such a treatment of certain physical behavior common to electrical and mechanical devices.A branch of mathe- matics calledpotential theory treats problems common to electric, magnetic, and gravitational fields and, indeed,in a degreeto aero- dynamics.Some theories seem, however, to be more mathematical than physical in their very inception. The World and Theories I We use many such mathematicaltheories in dealing with the physical world. Arithmetic is one of these.If we label one of a group of apples,dogs, or men l, another2, and so on, and if we have usedup just the first 16 numberswhen we have labeledall membersof the group, we feel confidentthat the group of objects can be divided into two equal groupseach containing 8 objects (16 + 2_ 8) or that the objectscan be arrangedin a square affay of four parallelrows of four objectseach (becausel6 is a perfect square;16 _ 4 X 4). Further, if we line the apples,dogs, or men up in a row, thereare 2,092,278,988,800possible sequences in which they can be arranged,corresponding to the 2,092,278,- 988,800different sequences of the integersI through 16.If we used up 13 rather than 16numbers in labeling the completecollection of objects,we feel equally certain that the collection could not be divided into any number of equal heaps,because 13 is a prime number and cannot be expressedas a product of factors. This seemsnot to depend at all on the nature of the objects. Insofar aswe can assignnumbers to the membersof any collection of objects,the resultswe get by addirg, subtracting,multiplying, and dividing numbersor by arrangirg the numbers in sequence hold true. The connection between numbers and collectionsof objects seemsso natural to us that we may overlook the fact that arithmetic is itself a mathematicaltheory which can be appliedto nature only to the degreethat the propertiesof numberscorrespond to propertiesof the physicalworld. Physiciststell us that we can talk senseabout the total number of a group of elementaryparticles, such as electrons,but we can't assignparticular numbers to particular particlesbecause the par- ticlesare in a very real senseindistinguishable. Thus, we can't talk about arrangitg suchparticles in different orders, as numberscan be arranged in different sequences.This has important conse- quencesin a part of physicscalled statisticalmechanics. We may also note that while Euclideangeometry is a mathematicaltheory which servessurveyors and navigatorsadmirably in their practical concerns,there is reasonto believethat Euclidean geometryis not quite accuratein describirg astronomicalphenomena. How can we describeor classify theories?We can say that a theory is very narrow or very general in its scope. We can also distinguish theoriesas to whether they are strongly physical or Symbols, Signals and Noise

strongly mathematical.Theories are stronglyphysical when they describe very completely some range of physical phenomena, which in practiceis alwayslimited. Theoriesbecome more mathe- matical or abstract when they deal with an idealized classof phenomenaor with only certain aspectsof phenomena.Newton's laws are strongly physical in that they afford a completedescription of mechanicalphenomena such as the motions of the planetsor the behavior of a pendulum. Network theory is more toward the mathematicalor abstractside in that it is useful in dealirg with a variety of tdeahzedphysical phenomena. Arithmetic is verymathe- matical and abstract; it is equally at home with one particular property of many sorts of physical entities,with numbersof dogs, numbers of men, and (if we remember that electronsare indistin- guishable)with numbersof electrons.It is evenuseful in reckoning numbers of days. In these terms, communication theory is both very strongly mathematicaland quite general.Although communicationtheory grew out of the study of electrical communication,rt attacksprob- lems in a very abstractand generalway. It provides,in the , a universal measureof amount of information in terms of choiceor uncertainty. Specifyingor learning the choicebetween two equally probable alt.*utives, *hirh miglit be messagesor numbersto bb transmitted, involves one bit of information. Communication theory tells us how many bits of information can be sentper second over perfect and imperfect communication channelsin termsof rather abstract descriptionsof the properties of these channels. Communication theory tells us how to measurethe rate at which a messagesource, such as a speakeror a writer, generatesinforma- tion. Communication theory tells us how to represent,or encode, messagesfrom a particular messagesource efficiently for trans- mission over a particular sort of channel,such as an electrical circuit, and it tells us when we can avoid errors in transmission. Becausecommunication theory discussessuch mattersin very general and abstract terms, it is sometimesdifficult to use the understandingit givesus in connectionwith particular, practical problems. However,because communication theory hal suchan abstract and generalmathematical form, it has a very broad field of application.Communication theory is usefulin connectionwith The World and Theories 9 written and spokenlanguage, the electricaland mechanicaltrans- missionof messag€S,the behavior of machines,and, perhaps,the behavior of people. Some feel that it has great relevanceand importance to physicsin a way that we shall discussmuch later in this book. Primarily, however,communication theory is, as Shannonde- scribed it, a mathematicaltheory of communication.The concepts are formulated in mathematicalterms, of which widely different physical examplescan be given. Engineers,psychologists, and physicistsmay usecommunication theory, but it remainsa mathe- matical theory rather than a physical or psychologicaltheory or an engineeringart. It is not easy to present a mathematical theory to a general audience,yet communication theory is a mathematical theory, and to pretendthat onecan discussit while avoidingmathematics entirely would be ridiculous.Indeed, the reader may be startled to find equationsand formulae in thesepages; these state accur- ately ideaswhich are also describedin words, and I have included an appendix on mathematicalnotation to help the nonmathe- matical readerwho wants to read the equationsaright. I am aware,however, that mathematicscalls up chiefly unpleas- ant picturesof multiplication, division, and perhapssquare roots, as well as the possiblytraumatic experiencesof high-schoolclass- rooms.This view of mathematicsis very misleading,for it places emphasison specialnotation and on tricks of manipulation,rather than on the aspectof mathematicsthat is most important to mathe- maticians. Perhapsthe reader has encountered theorems and proofs in geometry;perhaps he has not encounteredthem at all, yet theoremsand proofs are of prim ary importance in all mathe- matics, pure and applied. The important results of information theory are statedin the form of mathematicaltheoreffis, and these are theoremsonly becauseit is possibleto prove that they are true statements. Mathematiciansstart out with certain assumptionsand defini- tions,and then by meansof mathematicalarguments or proofs they are ableto showthat certainstatements or theoremsare true.This is what Shannonaccomplished in his "Mathematical Theory of Cornmunication.'oThe truth of a theorem dependson the validity l0 Symbols, Signals and l{oise of the assumptionsmade and on the validity of the argumentor proof which is used to establishit. All of this is pretty abstract.The best way to give someidea of the meaning of theorem and proof is certainly by means of ex- amples.I cannot do this by askirg the generalreader to grapple, one by one and in all their gory detail, with the difficult theorems of communication theory. Really to understand thoroughly the proofs of suchtheorems takes time and concentrationeven for one with some mathematicalbackground. At best, we can try to get at the content,meanitrg, and importance of the theorems. The expedientI proposeto resort to is to give someexamples of simpler mathematical theorems and their proof. The first example concernsa gamecalle d hex, or Ir{ash.The theoremwhich will be proved is that the player with first move can win. Hex is played on a board which is an array of forty-nine hexa- gonal cellsor spaces,as shown in Figure I- I, into which markers may be put. One player usesblack markersand tries to placethem so as to form a continuous,if wandering, path betweenthe black areaat the left and the black areaat the right. The other playeruses white markers and tries to place them so as to form a continuous, if wandering,path betweenthe white areaat the top and the white area at the bottom. The playersplay alternately,each placing one marker per play. Of course,one player has to start first. The World and Theories I I In order to prove that the first player can win, it is necessary first to prove that when the game is played out, so that there is either a black or a white marker in each cell, one of the players must have worr. TheoremI: Either oneplayer or the other wins. Discussion:In playing somegames, such as chessand ticktack- toe, it may be that neither player will win, that is, that the game will end in a draw. In matching headsor tails, one or the other necessarilywins. What one must show to prove this theorem is that, when each cell of the hex board is coveredby either a black or a white marker, either there must be a black path betweenthe black areaswhich will interrupt any possiblewhite path between the white areasor there must be a white path betweenthe white areas which will interrupt any possible black path between the black areas,so that either white or black must havewon. Proof: Assumethat each hexagonhas been filled in with either a black or a white marker. Let us start from the left-hand corner of the upper white border, point I of Figure I-2, and trace out the boundary betweenwhite and black hexagonsor borders.We will proceedalways along a side with black on our right and white on our left. The bound ary so traced out will turn at the successive corners,or vertices,at which the sides of hexagonsmeet. At a corner,or vertex,we can have only two essentiallydifferent con- 12 Symbols,Signals and Noise ditions. Either there will be two touching black hexagonson the right and one white hexagonon the left, as in a of Figure I-3, or two touching white hexagonson the left and one black hexagon on the right, as shown tn b of Figure I-3. We note that in either case there will be a continuous black path to the right of the boundary and a continuouswhite path to the left of the boundary. We also note that in neither a tror b of Figure I-3 can the boundary crossor join itself, becauseonly one path through the vertexhas black on the right and white on the left. We can seethat thesetwo facts are true for boundariesbetween the black and white borders and hexagonsas well as for boundariesbetween black and white hexagons.Thus, along the left side of the boundary theremust be a continuous path of white hexagonsto the upper white border, and along the right side of the bound ary there must be a continu- ous path of black hexagons to the left black border. As the boundary cannot crossitself, it cannot circleindefinitely, but must eventuallyreach a black border or a white border.If the boundary reachesa black border or white border with black on its right and white on its left, as we have prescribed,at any placeexcept corner II or cornerIII, we can extendthe bound ary further with black on its right and white on its left. Hence,the bound arywill reacheither point II or point III. If it reachespoint II, as shownin FigureI-2, the black hexagonson the right, which are connectedto the left black border, will also be connected to the right black border, while the white hexagonsto the left will be connectedto the upper white border only, and black will have won. It is clearlyimpossible for white to have won also, for the continuousband of adjacent

(a) (b) Fig. I-3 The World and Theories l3 black cellsfrom the left border to the right precludesa continuous band of white cellsto the bottom border. We seeby similar argu- ment that, if the boundary reaches point III, white will havewon. Theorem II: Theplayer with thefirst movecan win. Discussion:By canis meant that there existsa way, if only the player werewise enough to know it. The method for winning would consist of a particular first move (more than one might be allow- able but are not necessary)and a chart, formula, or other specifi- cation or recipegiving a correctmove following any possiblemove made by his opponentat any subsequentstage of the game,such that if, eachtime he plays, the first player makes the prescribed move, he will win regardlessof whit moves his opponent may make. Proof: Either theremust be someway of play which, if followed by the first player,will insure that he wins or else,no matter how the first player plays, the secondplayer must be able to choose moveswhich will precludethe first player from winning, so that he, the secondplayer, will win. Let us assumethat the player with the secondmove doeshave a surerecipe for winning. Let the player with the first movemake his first movein any wa), and then,after his opponent has made one move, let the player with the flrst move apply the hypotheticalrecipe which is supposedto allow the player with the secondmove to win. If at any time a move callsfor putting a pieceon a hexagonoccupied by u piece he has already played, let him placehis pieceinstead on any unoccupiedspace. The designatedspace will thus be occupied. The fact that by starting first he has an extra piece on the board may keep his opponent from occupyirg a particular hexagonbut not the player with the extra piece.Hence, the first player can occupy the hexa- gons designatedby the recipeand must win. This is contraryto the original assumptionthat the player with the secondmove can win, and so this assumptionmust be false. Instead, it must be possiblefor the player with the first move to win. A mathematicalpurist would scarcelyregard these proofs as rigorous in the form given. The proof of theorem II has another curious feature;it is not a constructiveproof. That is, it doesnot show the playerwith the first move, who can win in principle, how to go about winning. We will come to an example of a constructive 14 Symbols,Signals and Noise proof in a moment.First, however,it may be appropriateto phil- osophize a little concerning the nature of theoremsand the need for proving them. Mathematical theoremsare inherent in the rigorous statement of the generalproblem or field. That the player with the first move can win at hex is necessarilyso once the game and its rules of play have been specified.The theorems of Euclidean geometry are necessarilyso becauseof the stated postulates. With sufficientintelligence and insight, we could presumablysee the truth of theoremsimmediately. The young Newton is said to have found Euclid's theoremsobvious and to have beenimpatient with their proofs. Ordinarrly, while mathematiciansmay suspector conjecturethe truth of certain statements,they have to prove theoremsin order to be certain.Newton himself came to seethe importanceof proof, and he proved many new theoremsby using the methodsof Euclid. By and large, have to proceed step by stepin attainirg sureknowledge of a problem. They laboriously proveone theorem after another, rather than seeingthrough everythingin a flash. Too, they need to prove the theoremsin order to convince others. Sometimesa needsto prove a theorem to con- vince himself, for the theorem may seem contrary to common sense.Let us take the followirg problem as an example: Consider the square, I inch on a side,at the left of Figure I-4. We can specify any point in the squareby givitg two numbers,)/, the height of the point above the baseof the square,and x, the distanceof the point from the left-hand side of the square.Each of thesenumbers will be lessthan one. For instance, the point shown will be repre- sentedby x - 0.547000. . . (ending in an endlesssequence of zeros) y - 0.312000. . . (ending in an endlesssequence of zeros) Supposewe pair up points on the squarewith points on the line, so that every point on the line is paired with just one point on the square and every point on the square with just one point on the line. If we do this, we are said to have mappedthe squareonto the line in a one-to-onew&), or to have achieveda one-to-onemap- ping of the squareonto the line. The World and Theories l5

-P -T-- -l- l3-'-- U I Y f-x-i Fig. I-4

Theorem: It is possibleto map a squareof unit ereo onto a line of unit lengthin a one-to-onewa/.z Proof: Take the successivedigits of the height of the point in the squareand let them form the first, third, fifth, and so on digits of a number x'. Takethe digits of the distanceof the point P from the left side of the square,and let these be the second,fourth, sixth, etc.,of the digits of the number x'. Let x' be the distanceof the point P' from the left-hand end of the line. Then the point P' maps the point P of the squareonto the line uniquely, in a one- to-oneway.We seethat changingeither x or7 will changex'to a new and appropriatenumber, and changirg x' will changex and y. To each point x,y in the squarecorresponds just one point x' on the line, and to eachpoint x' on the line correspondsjust one point x,y Lnthe square,the requirementfor one-to-onemapping.3 In the caseof the examplegiven before x=0.547000... y - 0.312000 . . . y' :0.35 1427W0... In the caseof most points, including those specifiedby irrational numbers,the endlessstring of digits representingthe point will not becomea sequenceof zerosnor will it ever repeat. Here we have an exampleof a constructiveproof. We show that we can map eachpoint of a squareinto a point on a line segment in a one-to-oneway by giving an explicit recipe for doing this. Many mathematiciansprefer constructiveproofs to proofs which

z This has been restricted for convenience; the size doesn't matter. s This proof runs into resolvable difficulties in the case of some numbers such as t/2,which can be represented decimally .5 followed by an infinite sequence of zeros or .4 followed by an infinite sequence of nines. 16 Symbols, Signals and lr{oise are not constructive,and mathematiciansof the intuitionist school reject nonconstructiveproofs in dealingwith infinite sets,in which it is impossible to examine all the membersindividually for the property in question. Let us now consideranother matter concerningthe mappingof the points of a squareon a line segment.Imagine that we move a pointer along the line, and imagine a pointer simultaneously moving over the face of the squareso as to point out the points in the square correspondingto the points that the first pointer indicateson the line. We might imagine(contrary to what we shall prove) the followirg: If we moved the first pointer slowly and smoothly alongthe line, the secondpointer would move slowlyand smoothly over the faceof the square.A11 the points lying in a small cluster on the line would be representedby points lying in a small cluster on the face of the square.If we moved the pointer a short distance along the line, the other pointer would move a short distance over the face of the square,and if we moved the pointer a shorter distancealong the line, the other pointer would movea shorter distanceacross the face of the square,and so on. If this were true we could say that the one-to-onemapping of the points of the squareinto points on the line was continuous. However, it turns out that a one-to-onemapping of the points in a squareinto the points on a line cannot be continuous.As we move smoothly along a curve through the square,the pointson the line which representthe successivepoints on the squareneces- sarily jt*p around erratically, not only for the mapping described above but for any one-to-onemapping whatever.Aty one-to-one mapping of the squareonto the line ts discontinuotrs. Theorem: Any one-to-onemapping of a squoreonto a line must be discontinuous. Proof: Assumethat the one-to-onemapping is continuous.If this is to be so then all the points along some arbitrary curveAB of Figure I-5 on the squaremust map into the points lying between the correspondingpoints A' and B'.If they did not, in movingalong the curve in the squarewe would eitherj.r*p from one end of the line to the other (discontinuous mapping) or pass through one point on the line twice (not one-to-onemapping). Let us now choose a point C' to the left of line segmentA'B' and D' to the right of A'B' and locate the correspondingpoints C and D rn the The World and Theories l7

c'XB'D' Fig. I-5 square.Draw a curve connectingC and D and crossingthe curve from A to B. Where the curve crossesthe curveAB it will have a point in commonwith AB; hence,this one point of CD must map into a point lying between A' and B', and all other points which are not on A.B must map to points lying outsideof A'B', either to the left or the right of A'B'. This is contrary to our assumptionthat the mapping was continuous,and so the mapping cannot be continuous. We shall find that thesetheoreffis, that the points of a square can be mapped onto a line and that the mapping is necessarily discontinuous,are both important in communicationtheory, so we have proved one theoremwhich, unlike those conceroirg hex, will be of someuse to us. Mathematicsis a way of findirg out, step by step, factswhich are inherentin the statementof the problem but which arenot immediatelyobvious. Usually, in applying mathematicsone must first hit on the facts and then verify them by proof. Here we come upon a knotty problem, for the proofs which satisfiedmathema- ticians of an earlier duy do not satisfymodern mathematicians. In our own duy, an irascibleminor mathematicianwho reviewed Shannon'soriginal paper on communication theory expressed doubts as to whetheror not the author'smathematical intentions werehonorable. Shannon's theorems are true, however,and proofs have beengiven which satisfyeven rigor-crazed mathematicians. The simple proofs which I have given above as illustrations of mathematicsare open to criticism by purists. What I havetried to do is to indicatethe natureof mathematical reasoning,to give someidea of what a theorem is and of how it may be proved. With this in mind, we will go on to the mathe- matical theoryof communication,its theorems,which we shallnot really prove, and to some implications and associationswhich r8 Symbols, Signals and Noise extend beyond anything that we can establishwith mathematical certainty. As I have indicated earlier in this chapter, communication theory as Shannon has given it to us deals in a very broad and abstract way with certain important problems of communication and information, but it cannot be applied to all problems which we can phrase using the words communicationand information in their many popular senses.Communication theory dealswith certain aspectsof communication which con be associatedand organized in a useful and fruitful way, just as Newton's laws of motion deal with mechanicalmotion only, rather than with all the named and indeed different phenomena which Aristotle had in mind when he used the word motion. To succeed,science must attempt the possible.We have no reason to believe that we can unify all the things and conceptsfor which we use a common word. Rather we must seekthat part of experience which can be related. When we have succeededin relating certain aspectsof experiencewe have a theory. Newton's laws of motion are a theory which we can use in dealing with mechanical phenomena.Maxwell's equationsare a theory which we can use in connection with electrical phenomena.Network theory we can use in connectionwith certain simple sortsof elec- trical or mechanicaldevices. We can use arithmetic very generally in connectionwith numbersof men, stones,of stars,and geometry in measuringland, sea,or galaxies. Unlike Newton's laws of motion and Maxwell'sequations, which are strongly physical in that they deal with certain classesof physical phenomena,communication theory is abstractin that it applies to many sorts of communication,written, acoustical,or electrical.Communication theory dealswith certainimportant but abstract aspectsof communication. Communication theory pro- ceedsfrom clear and definite assumptionsto theoremsconcerning information sources and communication channels.In this it is essentiallymathematical, and in order to understandit we must understand the idea of a theorem as a statementwhich must be proved, that is, which must be shown to be the necessary conse- quenceof a set of initial assumptions.This is an idea which is the very heart of mathematicsas mathematiciansunderstand it. CHAPTERII The Origins of Information Theory

MrN HAvEBEEN at odds concerningthe value of history. Some have studied earliertimes in order to find a universal systemof the world, in whoseinevitable unfoldirg we can seethe future as well as the past.Others have soughtin the past prescriptionsfor successin the present.Thus, somebelieve that by studyingscientific discoveryin anotherduy we can learn how to make discoveries. On the other hand,one sageobserved that we learn nothing from history except that we never learn anything from history, and Henry Ford assertedthat history is bunk. All of this is as far beyond me as it is beyond the scopeof this book. I will, howeveromaintain that we can learn at least two things from the history of science. One of these is that many of the most general and powerful discoveriesof sciencehave arisen,not through the study of phe- nomena as they occur in nature, but, rather, through the study of phenomenain man-madedevices, in productsof technology,if you will. This is becausethe phenomenain man's machinesare simpli- fied and orderedin comparisonwith thoseoccurring naturally, and it is thesesimplified phenomena that man understandsmost easily. Thus, the existenceof the steam engine, in which phenomena involvitg heat,pressure, vap orizatior, and condensationoccur in a simple and orderly fashion,gave tremendousimpetus to the very powerful and generalscience of thermodynamics.We see this

l9 20 Symbols,Signals and l'{oise especiallyin the work of Carnot.l Our knowledgeof aerodynamics and hydrodynamicsexists chiefly becauseairplanes and ships exist,not becauseof the existenceof birds and fishes.Our knowl- edge of electricitycame mainly not from the study of lightning, but from the study of man's artifacts. Similarly, we shall find the roots of Shannon'sbroad and ele- gant theory of communicationin the simplified and seemingly easily intelligiblephenomena of . The secondthing that history can teachus is with what difficulty understandingis won. Today, Newton's laws of motion seem simple and almostinevitable, yet therewas a day whenthey were undreamedof, a day when brilliant men had the oddestnotions about motion. Even discoverersthemselves sometimes seem in- credibly denseas well asinexplicably wonderful. One mightexpect of Maxwell's treatiseon electricity and magnetisma bold and simple pronouncementconcerning the great step he had taken. Instead,it is clutteredwith all sortsof such lessermatters as once seemedimportant, so that a naivereader might searchlong to find the novelstep and to restateit in the sirnplemanner familiar to us. It is true, however,that Maxwell statedhis caseclearly elsewhere. Thus, a studyof the originsof scientificideas can help us to value understandingmore highly for its having beenso dearlywon. We can often seemen of an earlier duy stumbling along the edgeof discovery but unable to take the final step. Sometimeswe are tempted to take it for them and to say,because they statedmany of the requiredconcepts ir juxtaposition,that they mustreally have reachedthe generalconclusion. This, alas,is the sametrap into which many an ungratefulfellow falls in his own life. When some- one actually solvesa problem that he merely has had ideasabout, he believesthat he understoodthe matter all along. Properlyunderstood, then, the originsof an idea can help to show what its real contentis; what the degreeof understanding was beforethe idea came along and how unity and clarityhave been attained.But to attain such understandingwe must tracethe actual courseof discovery,not somecourse which we feeldiscovery

1N. L. S. Carnot(1796-1832) first proposedan ideal expansionof gas(the Carnot cycle) which will extract the maximum possible mechanical energy from the thermal energy of the steam. The Origins of Information Theory 2l shouldor could havetaken, and we must seeproblems (if we can) as the men of the past saw them, not as we seethem today. In looking for the origin of communication theory one is apt to fall into an almost tracklessmorass. I would gladly avoid this entirely but cannot, for others continually urge their readersto enter it. I only hope that they will emergeunharmed with the help of the followirg grudgingly given guidance. A particular quantity called entropl is used in thermodynamics and in statisticalmechanics. A quantity called entropyis usedin communication theory. After all, thermodynamics and statistical mechanicsare older than communication theory. Further, in a paper publishedin 1929,L. Szilard, a physicist, used an idea of information in resolvinga partrcularphysic al paradox.From these facts we might concludethat communication theory somehowgrew out of statisticalmechanics. This easybut misleadingidea has causeda greatdeal of confu- sion even among technicalmen. Actually, communication theory evolved from an effort to solve certain problems in the field of electricalcommunication. Its was calledentropy by mathe- matical analogy with the entropy of statistical mechanics.The chief relevanceof this entropy is to problems quite different from thosewhich statisticalmechanics attacks. In thermodynamics,the entropy of a body of gasdepends on its temperature,volume, and mass-and on what gasit is-just asthe energyof the body of gas does.If the gas is allowed to expandin a cylinder,pushing on a slowly movirg piston, with no flow of heat to or from the gos,the gaswill becomecooler, losittg someof its thermal energy.This energyappears as work done on the piston. The work ma), for instance,lift a weight, which thus storesthe energylost by the gas. This is a reversibleprocess. By this we mean that if work is done in pushingthe piston slowly back against the gas and so recom- pressingit to its original voluffie, the exact original energy,pres- sure, and temperaturewill be restored to the gas. In such a reversibleprocess, the entropy of the gas remains constant,while its energy changes. Thus, entropy is an indicator of reversibility; when there is no changeof entropy, the processis reversible.In the exampledis- 22 Symbols, Signals and Noise cussed above, energy can be transferred repeatedly back and forth between thermal energy of the compressed gas and mechanical energy of a lifted weight. Most physical phenomena are not reversible. Irreversible phe- nomena always involve an increase of entropy. Imagine, for instance, that a cylinder which allows no heat flow in or out is divided into two parts by u partitior, and supposethat there is gas on one side of the partition and none on the other. Imagine that the partition suddenly vanishes, so that the gas expands and fills the whole container. In this case, the thermal energy remains the same, but the entropy increases. Before the partition vanished we could have obtained mechani- cal energy from the gas by lettirg it flow into the empty part of the cylinder through a little engine. After the removal of the pat- tition and the subsequent increase in entropy, we cannot do this. The entropy can increase while the energy remains constant in other similar circumstances.For instance, this happens when heat flows from a hot object to a cold object. Before the temperatures were equalized, mechanical work could have been done by makittg use of the temperature difference. After the temperature difference has disappeared, we can no longer use it in changitg part of the thermal energy into mechanical energy. Thus, an increase in entropy means a decreasein our ability to change thermal energy, the energy of heat, into mechanical energy. An increase of entropy means a decreaseof available energy. While thermodynamics gave us the concept of entropy, rt does not give a detailed physical picture of entropy, in terms of positions and velocities of molecules, for instance. Statistical mechanicsdoes give a detailed mechanical meaning to entropy in particular cases. In general, the meaning is that an increase in entropy means a decrease in order. But, when we ask what order means, we must in some way equate it with knowledge. Even a very complex arrangement of molecules can scarcely be disordered if we know the position and velocity of every one. Disorder in the sensein which it is used in statistical mechanics involves unpredictability based on a lack of knowledge of the positions and velocities of molecules. Ordinarily we lack such knowledge when the arrange- ment of positions and velocities is "complicated." The Origins of Information Theory 23

Let us return to the example discussed above in which all the molecules of a gas are initially on one side of a partition in a cylinder. If the molecules are all on one side of the partitior, and if we know this, the entropy is less than if they are distributed on both sides of the partition. Certainly, we know more about the positions of the molecules when we know that they are all on one side of the partition than if we merely know that they are some- where within the whole container. The more detailed our knowl- edge is concerning a , the less uncertainty we have concerning it (concerning the location of the molecules, for instance) and the less the entropy is. Conversely, more uncertainty means more entropy. Thus, in physics, entropy is associated with the possibility of converting thermal energy into mechanical energy. If the entropy does not change during a process, the process is reversible. If ttle entropy increases,the avatlable en ergy decreases.Statistical me- chanics interprets an increaseof entropy as a decreasein order or if we wish, as a decreasein our knowledge. The applications and details of entropy in physics are of course much broader than the examples I have given can illustrate, but I believe that I have indicated its nature and something of its impor- tance. Let us now consider the quite different purpose and use of the entropy of communication theory. In communication theory we consider a messagesource, such as a writer or a speaker, which may produce on a given occasion any one of many possible messages.The amount of information conveyed by the messageincreases as the amount of uncertainty as to what messageactually will be produced becomes greater. A message which is one out of ten possible messagesconveys a smaller amount of information than a message which is one out of a million possible messages.The entropy of communication theory is a measure of this uncertainty and the uncertainty, or entropy, is taken as the measure of the amount of information conveyed by u messagefrom a source. The more we know about what message the source will produce, the less uncertainty, the less the entropy, and the less the information. We see that the ideas which gave rise to the entropy of physics and the entropy of communication theory are quite clifferent. One 24 Symbols,Signals and Noise can be fully usefulwithout any referenceat all to the other.None- theless,both the entropy of statistical mechanicsand that of communication theory can be describedin terms of uncertainty, in similar mathematical terms. Can some significant and useful relation be establishedbetween the two different entropiesand, indeed, between physics and the mathematical theory of com- munication? Several physicists and mathematicians have been anxious to show that communication theory and its entropy are extremely important in connectionwith statisticalmechanics. This is still a confusedand confusirg matter. The confusionis sometimesaggra- vated when more than one meaning of informationcreeps into a discussion.Thus, informationis sometimesassociated with the idea of knowledgethrough its popular use rather than with uncertainty and the resolution of uncertainty,as it is in communicationtheory. We witl consider the relation between cofirmunication theory and physics in Chapter X, after arrivin g at someunderstanding of communication theory. Here I will merely say that the effortsto marry communicationtheory and physicshave been more interest- irg than fruitful. Certainly, such attempts have not produced important new results or understanding,as communicationtheory has in its own right. Communication theory has its origins in the study of electrical communication, not in statistical mechanics,and some of the ideas important to communication theory go back to the very origins of electrical communication. Durin g a transatlanticvoyage in L832,Samuel F. B. Morse set to work on the first widely successfulform of electricaltelegraph. As Morse first worked it out, his telegraphwas much more com- plicated than the one we know. It actually drew short and long lines on a strip of paper,and sequencesof theserepresented, not the letters of a word, but numbers assignedto words in a diction- ary or code book which Morse completedin 1837.This is (aswe shall see)an efficientform of coditg, but it is clumsy. While Morse was working with , the old codingwas given up, and what we now know as the had been devisedby 1838.In this code,letters of the alphabetare represented by spaces,dots, and dashes.The spaceis the absenceof an electric The Origins of Information Theory 25 current, the dot is an electriccurrent of short duration, and the dash is an electriccurrent of longer duration. Variouscombinations of dots and dasheswere cleverlyassigned to the lettersof the alphabet.E, the letter occurringmost frequently in English text, was representedby the shortest possible code symbol, & singledot, and, in general,short combinationsof dots and dasheswere used for frequently used letters and long combi- nations for rarely usedletters. Strangely enough, the choicewas not guided by tablesof the relativefrequencies of various letters in English text nor were letters in text counted to get such data. Relative frequenciesof occurrenceof various letters were estimated by counting the number of types in the various compartmentsof a printer's type box! We can ask, would someother assignmentof dots, dashes,and spacesto lettersthan that usedby Morse enableus to sendEnglish text fasterby telegraph?Our -od.tn theory tells us that we could only gain about 15 per cent in speed.Morse was very successful indeed in achievinghis end, and he had the end clearly in mind. The lessonprovided by Morse's code is that it matters profoundly how one translatesa messageinto electricalsignals. This matter is at the very heart of communication theory. In 1843, Congresspassed a bill appropriating money for the construction of a telegraphcircuit betweenWashington and Balti- more. Morse started to luy the wire underground, but ran into difficulties which later plagued submarine cables even more severely.He solved his immediate problem by stringing the wire on poles. The difficulty which Morse encounteredwith his underground wire remained an important problem. Different circuits which conduct a steadyelectric current equally well are not necessarily equally suited to electricalcommunication. If one sendsdots and dashestoo fast over an undergroundor underseacircuit, they are run together at the receivirg end. As indicated in Figure II- l, when we senda short burst of current which turns abruptly on and off, we receiveat the far end of the circuit a longer, smoothed-out rise and fall of current. This longer flow of current may overlap the current of another symbol sent,for instance,as an absenceof current. Thus, as shown in Figure II-2, when a clear and distinct 26 Symbols, Signals and Noise

RECEIVED

TIME Fig" II-I

is transmittedit may be receivedas a vaguelywandering rise and fall of current which is difficult to interpret. Of course,if we makeour dots,spaces, and dasheslong enough, the current at the far end will follow the currentat the sendingend better, but this slowsthe rate of transmission.It is clear that there is somehowassociated with a given transmissioncircuit a limitirg speedof transmissionfor dots and spaces.For submarinecables this speedis so slow as to trouble telegraphers;for wireson poles it is so fast as not to bother telegraphers.Early telegraphistswere aware of this limitation, and it, too, lies at the heart of communi- cation theory.

SENT

RECEIVED

Fig. II-2 The Origins of Information Theory 27

Even in the face of this limitation on speed, various things can be done to increase the number of letters which can be sent over a given circuit in a given period of time. A dash takes three times as long to send as a dot. It was soon appreciated that one could gain by means of double-current telegraphy. We can understand this by imagining that at the receivi.g end a galvanometer, a device which detects and indicates the direction of flow of small currents, is connected between the telegraph wire and the ground. To indicate a dot, the connects the positive terminal of his battery to the wire and the negative terminal to ground, and the needle of the galvanometer moves to the right. To send a dash, the sender connects the negative terminal of his battery to the wire and the positive terminal to the ground, and the needle of the galva- nometer moves to the left. We say that an electric current in one direction (into the wire) represents a dot and an electric current in the other direction (out of the wire) represents a dash. No current at all (battery disconnected) represents a space. In actual double-current telegraphy, a different sort of receiving instrument is used. In single-current telegraphy we have two elements out of which to construct our code: current and no current, which we might call I and 0. In double-current telegraphv we really have three elements, which we might characterize as forward current, or current into the wire; no current; backward current, or current out of the wire: or as + l, 0, - l. Here the + or sign indicates the direction of current flow and the number I gives the magnitude or strength of the current, which in this case is equal for current flow in either direction. In 1874, went further; in his quadruplex tele- graph system he used two intensities of current as well as two directions of current. He used changes in intensity, regardlessof changes in direction of current flow to send one message, and changes of direction of current flow regardless of changes in intensity, to send another message. If we assume the currents to differ equally one from the next, we might represent the four different conditions of current flow by means of which the two messagesare conveyed over the one circuit simultaneously as + 3, + l, -1, -3. The interpretation of these at the receivirg end is shown in Tieble I. 28 Symbols, Signals and Noise

TenrE I

Meaning Current Transmitted Message I Message 2 +3 on on +l off on -1 off off -3 on off

Figure II-3 shows how the dots, dashes,and spacesof two simultaneous,independent messages can be representedby a suc- cessionof the four different current values. Clearly, how much information it is possible to send over a circuit depends not only on how fast one can send successive symbols (successivecurrent values) over the circuit but alsoon how many different symbols(different current values)one hasavailable to chooseamong. If we have as symbolsonly the two currents+ I or 0 or, which is just as effective,the two currents + I and - l, we can convey to the receiver only one of two possibilitiesat a time. We have seenabove, however, that if we can chooseamong any one of four current values (any one of four symbols) at a

MESSAGE ON

OFF

ON

OFF

+3 +l CURRENT -f

-3 Fig. II-3 The Origins of Information Theory 29

time, suchas +3 or + 1 or - I or -3, we can conveyby means of thesecuffent values(symbols) two independentpieces of infor- mation: whetherwe mean a 0 or I in messageI and whetherwe meana 0 or I in message2.Thus, for a givenrate of sendingsucces- sive symbols,the use of four current valuesallows us to sendtwo independentmessages, each as fast as two current valuesallow us to sendone message.We can sendtwice as many letters per minute by using four current values as we could using two current values. The use of multiplicity of symbolscan lead to difficulties.We have noted that dots and dashessent over a long submarinecable tend to spreadout and overlap.Thus, when we look for one symbol at the far end we see,as Figure II-2 illustrates, a little of several others. Under thesecircumstances, & simple identification, &SI or 0 or else + I or -1, is easierand more certain than a more com- plicatedindentification, as among +3, + l, - l, -3. Further, other matters limit our ability to make complicated distinctions.During magneticstorms, extraneous signals appear on telegraphlines and submarinecables.2 And if we look closely enougli, as we can today with sensitiveelectronic amplifiers,wb seethat minute, undesiredcurrents are always present.These are akin to the erratic Brownian motion of tiny particles observed under a microscopeand to the agitation of air moleculesand of all other matter which we associatewith the idea of heat and temperature.Extraneous currents, which we call noise,are always presentto interferewith the signalssent. Thus,even if we avoid the overlappingof dots and spaceswhich is called intersymbolinterference, noise tends to distort the received signal and to make difficult a distinction among many alternative symbols. Of course,increasing the current transmitted, which means increasingthe power of the transmitted signal, helps to overcomethe effectof noise.There are limits on the power that can be used,however. Drivin g alatge current through a submarine cabletakes alarge voltage, and alargeenough voltage can destroy the insulationof the cable-can in fact causea short circuit. It is likely that the large transmitting voltage used causedthe failure of the first transatlantictelegraph cable in 1858.

z The changing magnetic field of the earth induces currents in the cables. The changes in the earth's magnetic field are presumably caused by streams of charged particles due to solar storms. 30 Symbols,Signals and Noise Even the early telegraphistsunderstood intuitively a gooddeal about the limitations associatedwith speedof signaling,interfer- ence, or noise,the difficulty in distinguishing among many alter- native valuesof current, and the limitation on the power that one could use. More than an intuitive understandingwas required, however.An exactmathematical analysis of suchproblems was needed. Mathematicswas early applied to suchprobleffiS, though their complete elucidationhas come only in recent years.In 1855, William Thomson, later Lord Kelvin, calculated preciselywhat the receivedcurrent will be when a dot or spaceis transmittedover a submarine cable. A more powerful attack on such problems followed the invention of the by in 1875. makes use,not of the slowly sentoflon signalsof telegraphy,but rather of currentswhose strengthvaries smoothly and subtly over a wide range of amplitudes with a rapidity severalhundred times as great as encounteredin manual telegraphy. Many men helpedto establishan adequatemathematical treat- ment of the phenomenaof telephony: Henri Poincar6,the great French mathematician;Oliver Heaviside,an eccentric,English, minor genius;Michael Pupin, of From Immigrant to Inventorfame; and G. A. Campbell,of the American Telephoneand Telegraph Company, ateprominent among these. The mathematical methods which these men used were an extensionof work which the French mathematicianand physicist, JosephFourier, had done early in the nineteenthcentury in connec- tion with the flow of heat.This work had beenapplied to the study of vibration and wasa natural tool for the analysisof the behavior of electric currentswhich changewith time in a complicatedfash- ion-as the electriccurrents of telephonyand telegraphydo. It is impossibleto proceed further on our way without under- standing somethingof Fourier's contribution, a contributionwhich is absolutelyessential to all communication and communication theory. Fortunately,the basicideas are simple;it is their proof and the intricaciesof their application which we shallhave to omit here. Fourier basedhis mathematicalattack on someof the problems of heat flow on a very particular mathematicalfunction calleda The Origins of Information Theory 3l sine wove. Part of a sine wave is shown at the right of Figure Il-4. The height of the wave h varies smoothly up and down as time passes,fluctuating so forever and ever. A sine wave has no begin- ning or end. A sine wave is not just any smoothly wiggling curve. The height of the wave (it may represent the strength of a current or voltage) varies in a particular way with time. We can describe this variation in terms of the motion of a crank connected to a shaft which revolves at a constant speed, as shown at the left of Figure II-4. The height h of the crank above the axle varies exactly sinusoidally with time. A sine wave is a rather simple sort of variation with time. It can be charactertzed, or described, or differentiated completely from any other sine wave by means ofjust three quantities. One of these is the maximum height above zero, called the amplitude. Another is the time at which the maximum is reached, which is specified as the phase. The third is the time Z between maxim a, calIed the period. Usually, we use instead of the period the reciprocal of the period called the frequencl, denoted by the letter f, If the period T of a sine wave is | / 100 second, the frequency -f i, 100 cycles per second, abbreviated cps. A cycle is a complete variation from crest, through trough, and back to crest again. The sine wave is periodic in that one variation from crest through trough to crest agarn is just like any other. Fourier succeededin proving a theorem concerning sine waves which astonished his, at first, incredulous contemporaries. He showed that any variation of a quantity with time can be accurately represented as the sum of a number of sinusoidal variations of

I I h hl. r-

I

i'il-. Fig. I I -4 32 Symbols,Signals and Noise different amplitudes,phases, and frequencies.The quantity con- cerned might be the displacementof a vibrating string, the height of the surfaceof a rough ocean,the temperatureof an electriciron, or the current or voltage in a telephone or telegraph wire. A11are amenable to Fourier's analysis.Figure II-5 illustrates this in a simple case.The height of the periodic curve a abovethe centerline is the sum of the heights of the sinusoidal curvesb and c. The mere representationof a complicated variation of some physical quantity with time as a sum of a number of simple sinus- oidal variations might seem a mere mathematician's trick. Its utility dependson two important physical facts.The circuitsused in the transmissionof electrical signalsdo not changewith time, and they behavein what is called a linear fashion. Suppose,for instance, we send one signal, which we will call an input signal, over the line and draw a curve showirg how the amplitude of the receivedsignal varies with time. Supposewe send a secondinput signal and draw a curve showirg how the correspondingreceived signal varies with time. Supposewe now send the sum of the two input signals,that is, a signal whose current is at every moment the simple sum of the currents of the two separateinput signals. Then, the receivedoutput signal will be merely the sum of the two output signalscorresponding to the input signalssent separately. We can easily appreciatethe fact that communication circuits don't change significantly with time. Linearity means simply that

(a)

(b)

(c)

Fig. II-5 The Origins of Information Theory 33 if we know the output signalscorresponding to any number of input signalssent separately,we can calculatethe output signal when severalof the input signalsare senttogether merely by adding the output signalscorrespondirg to the input signals.In a linear electricalcircuit or transrnissionsysteffi, signals act as if they were presentindependently of one another; they do not integact.This is, indeed,the very criterion for a circuit being called a linear circuit. While linearity is a truly astonishingproperty of nature,it is by no meansa rare one. A11circuits made up of the resistors,capaci- tors, and inductors discussedin Chapter I in connection with network theory are linear, and so are telegraph lines and cables. Indeed, usually electricalcircuits are linear, except when they include vacuum tubes,or transistors,or diodes,and sometimes even suchcircuits are substantiallylinear. Becausetelegraph wires are linear, which is just to say because telegraph wires are such that electrical signals on them behave independentlywithout interactingwith one another,two telegraph signalscan travel in oppositedirections on the samewire at the same time without interfering with one another. However, while linearity is a fairly common phenomenon in electrical circuits, it is by no meansa universalnatural phenomenon.Tivo trains can't travel in oppositedirections on the sametrack without interference. Presumably they could, though, if all the physical phenomena comprisedin trainswere linear. The readermight speculateon the unhappy lot of a truly linear race of beings. With the very surprisittgproperty of linearity in mind, let us return to the transmissionof signals over electrical circuits. We have noted that the output signal correspondingto most input signalshas a different shapeor variation with time from the input signal. Figures II- l and II-2 illustrate this. However, it can be shown mathematically(but not here) that, if we use a sinusoidal signal,such as that oi Figure II-4, as an input signal to a linear transmissionpath, we always get out a sine wave of the same period, or frequency.The amplitude of the output sine wave may be lessthan that of the input sine wave; we call this attenuationof the sinusoidalsignal. The output sine wave may rise to a peak later than the input sine wave; we call this phase shtft, or delay of the sinusoidalsignal. 34 Syrnbols, Signals and Noise The amounts of the attenuation and delay dependon the fre- quency of the sine wave. In fact, the circuit may fail entirely to transmit sine wavesof somefrequencies. Thus, correspondingto an input signal made up of severalsinusoidal components, there will be an output signalhaving components of the samefrequencies but of different relative phasesor delays and of different ampli- tudes. Thus, in general the shape of the output signal will be different from the shapeof the input signal.However, the difference can be thought of as causedby the changesin the relativedelays and amplitudesof the variouscomponents, differences associated with their differentfrequencies. If the attenuation and delayof a circuit is the samefor all frequencies,the shapeof the output wave will be the same as that of the input wave; such a circuit is distortionless. Becausethis is a very important matter, I have illustratedit in Figure II-6. In a we have an input signalwhich can be expressed as the sum of the two sinusoidalcomponents, b and c. In trans- mission,b is neitherattenuated nor delayed,so the output b' of the samefrequency as b is the sameas b. However, the output c' due to the input c is attenuatedand delayed.The total output a', the sum of b' and c', clearly has a different shapefrom the input a. Yet, the output is made up of two componentshaving the same frequenciesthat are presentin the input. The frequencycompo- nents merely have different relative phasesor delaysand different relative amplitudesin the output than in the input. The Fourier analysisof signalsinto componentsof variousfre- quenciesmakes it possibleto study the transmissionproperties of a linear circuit for all signalsin terms of the attenuationand delay it imposes on sine waves of various frequenciesas they pass through it. Fourier analysisis a powerful tool for the analysisof transmis- sion problems.It provided mathematiciansand engineerswith a bewildering variety of resultswhich they did not at first clearly understand.Thus, early telegraphistsinvented all sortsof shapes and combinationsof signalswhich were allegedto havedesirable properties, but they were often inept in their mathematicsand wrong in their arguments.There was much disputeconcerning the efficacyof various signalsin ameliorating the limitations imposed The Origins of Information Theory 35 by circuit speed,intersymbol interference,noise, and limitations on transmittedpower. In l9l7 , cameto the American Telephoneand TelegraphCompany immediatelyafter receivirg his Ph.D. at Yale (Ph.D.'s were considerablyrarer in those days). Nyquist was a much better mathematicianthan most men who tackled the prob- lems of telegraphy, and he always was a clear, original, and philosophicalthinker concerningcommunication. He tackledthe problems of telegraphywith powerful methods and with clear insight.In 1924,he publishedhis resultsin an important paper, "Certain FactorsAffectirg TelegraphSpeed."

(a)

(b)

(c)

( b')

( c')

( a')

Fig. II-6 l 36 Symbols,Signals and Noise This paper deals with a number of problems of telegraphy. Among other things,it clarifies the relation betweenthe speedof telegraphy and the number of current values such as + l, - I (two current values)or +3, + l, -1, -3 (four currentvalues). Nyqtist saysthat if we send symbols (successivecurrent values) at a constant rate, the speedof transmissioil,W, is related to m, the number of different symbols or current values avatlable,by W - Klogm Here K is a constantwhose value dependson how many successive current values are sent each second.The quantity log m means logarithm of m. There are different basesfor taking logarithms.If we choose2 as a base,then the valuesof log m for variousvalues of m are given in Table II.

T.qorE II

log m I 0 2 I 3 1.6 4 2 8 3 l6 4

To sum up the matter by means of an equatior, log x is such a number that

/lot, u : X We may seeby takitg the logarithm of eachside that the following relation must be true: log )ros,t : log x If we write M tn place of log a we seethat Iog2a- M All of this is consistentwith Table II. We can easily seeby means of an example why the logarithm is the appropriate function in Nyqrist's relation. Supposethat we The Origins of Information Theory 37 wish to specifytwo independentchoices of off-or-or, 0-or-1,simul- taneourlt. There are four possiblecombinations of two independ- ent 0-or-l choices,as shownin TableIII.

T,q.grnIII

First 0-on- I Second O-on-1 Number of Combination Choice Choice I 0 0 2 0 I 3 I 0 4 I I

Further, if we wish to specifythree independentchoices of 0-or-I at the sametime, we find eight combinations,as shown in TableIV.

Tnnrr IV

First O-on- l Second 0-on- | Third O-on-I Number of Combination Choice Choice Choice I 0 0 0 2 0 0 I 3 0 I 0 4 0 I I 5 0 0 6 0 I 7 I 0 8 I I

Similarly, if we wish to specify four independent 0-or- I choices, we find sixteendifferent combinations,and, if we wish to specify M different independent 0-or- I choices, we find 2M different combinations. If we can specifyM rndependent0-or- I combinations at once, we can in effect sendM tndependentmessages at once, so surely the speedshould be proportional to M. But, in sendingM messages at once we have 2M possiblecombinations of the M tndependent 0-or-I choices.Thus, to sendM messagesat once, we needto be able to send 2Mdifferent symbolsor current values. SrrpPosethat we can chooseamong 2Mdifferent symbols. Nyqtist tells us that 38 Symbols, Signals and Noise

we should take the logarithm of the number of symbolsin order to get the line speed,and log2a: M Thus, the logarithm of the number of symbolsis just the number of independent 0-or-I choices that can be representedsimulta- neously,the number of independentmessages we can sendat once, so to speak. Nyquist's relation saysthat by going from off-on telegraphyto three-current(+ l, 0, - l) telegraphywe can increasethe speedof sendingletters or other symbolsby 60 per cent, and if we usefour current values(+3, + l, - I s -3) we can doublethe speed.This is, of course,just what Edison did with his quadruplex telegraph, for he sent two messagesinstead of c)ne.Further, Nyeuist showed that the useof eightcurrent values (0, I, 2, 3, 4, 5, 6, 7, or +7 , + 5, +3, * l, - 1, -3, -5 s -7) should enableus to sendfour times as fast as with two current values.However, he clearlyrealtzedthat fluctuations in the attenuation of the circuit, interferenceor noiseo and limitations on the power which can be used,make the useof many current valuesdifficult. Turning to the rateat which signal elementscan be sent,Nyquist defined the line speedas one half of the number of signal elements (dots, spaces,current values)which can be transmittedin a second. We will find this definition particularly appropriate for reasons which Nyquist did not give in this early paper. By the time that Nyquist wrote, it was common practiceto send telegraph and telephone signals on the same wires. Telephony makes use of frequenciesabove 150 cps, while telegraphycan be carried out by meansof lower frequency signals.Nyquist showed how telegraph signalscould be so shapedas to have no sinusoidal components of high enough frequency to be heard as interference by telephonesconnected to the same line. He noted that the line speed,and hence also the speedof transmissior, was proportional to the width or extent of the range or band (in the senseof strip) of frequenciesused in telegraphy; we now call this range of fre- quencies the band width of a circuit or of a signal. Finally, in analyzitg one proposed sort of telegraph signal, The Origins of Information Theory 39

Nyquist showedthat it containedat all times a steadysinusoidal componentof constantamplitude. While this component formed a part of the transmitterpower used,it was uselessat the receiver, for its eternal,regular fluctuations were perfectly predictableand could have beensupplied at the receiverrather than transmitted thenceover the circuit. Nyq.rist referredto this uselesscomponent of the signal,which, he said,conveyed no intelligence,as redundant, a word which we will encounterlater. Nyquist continuedto study the problemsof telegraphy,and in o'Certain 1928he publisheda secondimportant paper, Topicsin TelegraphTiansmission Theory." In this he demonstrateda num- ber of very important points. He showedthat if one sendssome number 2N of different current valuesper second,all the sinusoidal components of the signal with frequenciesgreater than N are redundant,in the sensethat they are not neededin deducitg from the receivedsignal the successionof current valueswhich were sent. If all of therJ ttigher frequencieswere removed, one could still deduce by studying the signal which current values had been transmitted.Further, he showedhow a signalcould be constructed which would contain no frequenciesabove N cps and from which it would be very easyto deduceat the receivitg point what current valueshad beensent. This secondpaper was more quantitativeand exact than the first; together,they embracemuch important mate- rial that is now embodiedin communicationtheory. R. V. L. Hartley, the inventor of the Hartley oscillator, was thinking philosophicallyabout the transmissionof information at about this time, and he summanzed his reflections in a paper, "Tiansmission of Informationo" which he published in 1928. Hartley had an interestinE way of formulating the problem of communication,one of those ways of putting things which may seemobvious when statedbut which can wait yearsfor the insight that enablessomeone to make the statement.He regarded the senderof a messageas equippedwith a set of symbols(the letters of the alphabetfor instance)from which he mentally selectssymbol after symbol, thus generatinga sequenceof symbols. He observed that a chanceevent, such as the rolling of balls into pockets,-ight equally well generatesuch a sequence.He then defined H, the 40 Symbols,Signals and Noise

information of the message,as the logarithm of the number of possiblesequences of symbolswhich might havebeen selected and showed that H-nlogs Here n is the number of symbols selected,and s is the numberof different symbolsin the set from which symbols are selected. This is acceptablein the light of our present knowledge of information theory only if successivesymbols are chosenindepend- ently and if any of the r symbols is equally likely to be selected. In this case,we needmerely note, as before,that the logarithm of r, the number of symbols,is the number of independent0-or-l choices that can be representedor sent simultaneously,and it is reasonablethat the rate of transmissionof information shouldbe the rate of sending symbols per secondn, times the number of independent0-or- I choicesthat can be conveyedper symbol. Hartley goeson to the problem of encodingthe primary symbols (letters of the alphabet,for instance)in termsof secondarysymbols (e.9.,the sequencesof dots,spaces, and dashesof the Morsecode). He observesthat restrictionson the selectionof symbols(the fact that E is selectedmore often than Z) should govern the lengthsof the secondary symbols(Morse code representations)if we are to transmit messagesmost swiftly. As we have seen,Morse himself understood this, but Hartley stated the matter in a way which encouragedmathematical attack and inspired further work. Hart- l.y also suggesteda way of applying such considerationsto con- tinuous signals,such as telephonesignals or picture signals. Finally, Hartley stated,in accord with Nyquist,-ii that the amount of information which can be transmitteO proportional to the band width times the time of transmission.But this makes us wonder about the number of allowable current values,which is also important to speed of transmission.How are we to enumerate them? After the work of Nyquist and Hartle), communication theory apPearsto have taken a prolonged and iomfortable rest.Workers busily built and studied particular communication systems.The art grew very complicatedindeed during World War Ii. Much new understanding of particular new communication systems and The Origins of Info*ation Theory 4l deviceswas achieved,but no broad philosophical principles were laid down. During the war it becameimportant to predict from inaccurate or "noisy" radar data the coursesof airplanes,so that the planes could be shot down. This raisedan important question: Suppose that one has a varying electriccurrent which representsdata con- cerning the presentposition of an airplane but that there is added to it a secondmeaningless erratic current, that is, a noise. It may be that the frequenciesmost strongly present in the signal are different from the frequenciesmost strongly presentin the noise. If this is so,it would seemdesirable to passthe signalwith the noise added through an electricalcircuit or filter which attenuatesthe frequenciesstrongly presentin the noise but does not attenuate very much the frequenciesstrongly present in the signal.Then, the resulting electriccurrent can be passedthrough other circuits in an effort to estimateor predict what the value of the original signal, without noise,will be a few secondsfrom the present.But what sort of combination of electricalcircuits will enable one best to predict from the presentnoisy signal the value of the true signal a few secondsin the future? In essence,the problem is one in which we deal with not one but with a whole ensembleof possiblesignals (courses of the plane), so that we do not know in advancewhich signal we are dealing with. Further, we are troubled with an unpredictablenoise. This problem was solvedin Russiaby A. N. Kolmogoroff. In this country it was solvedindependently by . Wiener is a mathematicianwhose background ideally fitted him to deal with this sort of problem, and during the war he produced a yellow-bound document, affectionately called "the yellow peril'o (becauseof the headachesit caused),in which he solvedthe diffi- cult problem. During and after the war another mathematici&n, Claude E. Shannrn,interested himself in the generalproblem of communica- tion. Shannon began by considering the relative advantagesof many new and fanciful communication systems,and he sought some basic method of comparing their merits. In the same year (1948)that Wiener publishedhis book, Cybernetics,which deals with communication and control, Shannon published in two parts I

42 Symbols,Signals and Noise a paper which is regarded as the foundation of modern communi- cation theory. Wiener and Shannon alike consider,not the problem of a single signal, but the problem of dealing adequately with any signal selectedfrom a group or ensembleof possiblesignals. There was a free interchange among various workers before the publication of either Wiener's book or Shannon'spaper, and similar ideasand expressions appear in both, although Shannon's inteqpretation appearsto be unique. Chiefly, Wiener's name has come to be associatedwith the field of extracting signals of a given ensemblefrom noise of a known type. An example of this has been given above.The enemypilot follows a course which he choses,and our radar adds noise of natural origin to the signalswhich representthe position of the plane. We have a set of possiblesignals (possible courses of the airplane), not of our own choosin5,mixed with noise,not of our own choosing,and we try to make the bestestimate of the present or future value of the signal (the present or future position of the airplane) despite the noise. Shannon'sname has come to be associatedwith mattersof so encodirg messageschosen from a known ensemblethat they can be transmitted accurately and swiftly in the presenceof noise.As an example, we may have as a messagesource English text, not of our own choositrB,and an electrical circuit, s€rl,a noisy telegraph cable, not of our own choosing. But in the problem treated by Shannoil, we are allowed to choose how we shall representthe messageas an electrical signal-how many current valueswe shall allow, for instance,and how many we shall transmit per second. The problem, then, is not how to treat a signal plus noise so as to get a best estimateof the signal, but what sort of signal to send so as best to conveymessages of a given type over a particularsort of noisy circuit. This matter of efficient encodirg and its consequencesform the chief substanceof information theory. In that an ensembleof messagesis considered,the work reflectsthe spirit of the work of Kolmogoroff and Wiener and of the work of Morse and Hartley as well. It would be uselessto review here the content of Shannon's The Origins of Info*ation Theory 43 work, for that is what this book is about. We shall see,however, that it shedsfurther light on all the problems raised by Nyquist and Hartley and goesfar beyond thoseproblems. In looking back on the origins of communication theory, two other namesshould perhaps be mentioned.In 1946, published an ingeniouspaper, "Theory of Communication." This, suggestiveas it is, missedthe inclusion of noise, which is at the heart of modern communicationtheory. Further, in 1949,'W.G. Tuller published an interestingpaper, "Theoretical Limits on the Rate of Transmissionof Informatiofr," which in part parallels Shannon'swork. The gist of this chapter has been that the very generaltheory of communicationwhich Shannonhas given us grew out of the study of particular problems of electrical communication. Morse was faced with the problem of representingthe letters of the alphabet by short or long pulsesof current with intervening spacesof no current-that is, by the dots, dashes,and spacesof telegraphy.He wisely choseto representcommon letters by short combinations of dots and dashesand uncommon letters by long combinations; this was a first stepin efficientencoding of messages,a vital part of communicationtheory. Ingeniousinventors who followed Morse made use of different intensitiesand directionsof current flow in order to give the sender a greater choice of signals than merely off-or-on. This made it possibleto sendmore letters per unit time, but it made the signal more susceptibleto disturbance by unwanted electrical disturb- ances called noise as well as by inability of circuits to transmit accuratelyrapid changesof current. An evaluationof the relative advantagesof many different sorts of telegraphsignals was desirable. Mathematical tools wereneeded for such a study. One of the most important of theseis Fourier analysis,which makesit possibleto representany signal as a sum of sinewaves of variousfrequencies. Most communicationcircuits are linear. This meansthat several signalspresent in the circuit do not interact or interfere.It can be shown that while even linear circuits change the shape of most signals, the effect of a linear circuit on a sine wave is merely to make it weaker and to delay its time of arrival. Hence, when a M Symbols,Signals and Noise complicatedsignal is representedas a sum of sinewaves of various frequencies,it is easyto calculate the effect of a linear circuit on each sinusoidal component separately and then to add up the weakenedor attenuatedsinusoidal components in order to obtain the over-all receivedsignal. Nyquist showed that the number of distinct, different current valueswhich can be sentover a circuit per secondis twice the total range or band width of frequenciesused. Thus, the rate at which letters of text can be transmitted is proportional to band width. Nyquist and Hartley also showed that the rate at which lettersof text can be transmitted is proportional to the logarithm of the number of current valuesused. A complete theory of communication required other mathe- matical tools and new ideas.These arerelated to work done by Kolmogoroff and Wiener, who considered the problem of an unknown signalof a given type disturbed by the addition of noise. How doesone bestestimate what the signalis despitethe presence of the interferirg noise? Kolmogoroff and Wiener solved this problem. The problem Shannonset himself is somewhatdifferent. Suppose we have a messagesource which producesmessages of a giventype, such as English text. Suppose we have a noisy of specified characteristics.How can we representor encodemessages from the messagesource by meansof electrical signals so as to attain the fastest possibletransmission over the noisy channel?Indeed, how fast can we transmit a given type of messageover a given channelwithout error? In a rough and general wa/, this is the problem that Shannonset himself and solved. CHAPTERIII A Mathematical Model

A uernEMArICALrHEoRy which seeksto explain and to predict the events in the world about us always deals with a simplified model of the world, a mathematicalmodel in which only things pertinent to the behavior under considerationenter. Thus, planetsare composedof various substances,solid, liquid, and gaseous,at variouspressures and temperatures.The parts of their substancesexposed to the rays of the sun reflect various fractions of the different colors of the light which falls upon them, so that when we observeplanets we seeon them various colored features.However, the mathematicalastronomer in predictingthe orbit of a planetabout the sun needtake into accountonly the total massof the sun, the distanceof the planet from the sun, and the speedand direction of the planet's motion at some initial instant. For a more refinedcalculation, the astronomermust also take into account the total massof the planet and the motions and masses of other planetswhich exert gravitationalforces on it. This does not mean that astronomersare not concernedwith other aspectsof planets,and of stars and nebulae as well. The important point is that they neednot take theseother mattersinto considerationin planetary orbits. The great beautyand power of a mathematical theory or model lies in the separationof the relevantfrom the irrelevant.so that certain observablebehavior

45 46 Symbols, Signals and Noise can be relatedand understoodwithout the needof comprehending the whole nature and behavior of the universe. Mathematical models can have various degreesof accuracyor applicability. Thus, we can accuratelypredict the orbits of planets by regarding them as rigid bodies, despitethe fact that no truly rigid body exists.On the other hand, the long-term motions of our moon can only be understoodby taking into account the motion of the waters over the face of the earth, that is, the tides.Thus, in dealing very precisely with lunar motion we cannot regard the earth as a rigid body. In a similar wa/, in network theory we study the electrical properties of interconnectionsof ideal inductors,capacitors, and resistors,which are assignedcertain simple mathematicalproper- ties. The componentsof which the actual useful circuits in radio, TV, and telephone equipment are made only approximate the properties of the ideal inductors, capacitors,and resistorsof net- work theory. Sometimes,the differenceis trivial and can be disre- garded. Sometimesit must be taken into accountby more refined calculations. Of course,a mathematical model may be a very crude or even an invalid representationof events in the real world. Thus, the self-interested,gain-motivated "economic man" of early economic theory has fallen into disfavor becausethe behavior of the eco- nomic man doesnot appearto correspondto or to usefully explain the actual behavior of our economic world and of the peoplein it. In the orbits of the planets and the behavior of networks,we have examplesof rdeahzeddeterminislic systems which have the sort of predictable behavior we ordinarily expect of machines. Astronomers can compute the positions which the planets will occupy millennia in the future. Network theory tells us all the subsequentbehavior of an electrical network when it is excitedby a particular electrical signal. Even the individual economic man is deterministic, for he will always act for his economic gain. But, if he at some time gambles on the honest throw of a die becausethe odds favor him, his economic fate becomesto a degreeunpredictable, for he may lose even though the odds do favor him. We can,however, make a mathematical model for purely chance A Mathematical Model 47 events,such as the drawing of some number, say three, of white or black balls from a container holding equal numbers of white and black balls.This model tells us, in fact, that after many trials we will have drawn all white about r/sof the time, two whites and a black about3/aof the time, two blacks and a white about3/sof the time, and all black abovt r/sof the time. It can also tell us how much of a deviationfrom theseproportions we may reasonably expect after a given number of trials. Our experienceindicates that the behavior of actual human beings is neither as determined as that of the economic man nor as simply random as the throw of a die or as the drawing of balls from a mixture of black and white balls. It is clear, however,that a deterministic model will not get us far in the consideration of human behavior, such as human communication, while a random or statisticalmodel might. Wb all know that the actuarial tablesused by insurancecom- paniesmake falr predictionsof the fraction of alarge group ofmen in a given agegroup who will die in one year, despite the fact that we cannot predict when a particular man will die. Thus a statistical model may enableus to understand and even to make somesort of predictions concerninghuman behavior, even as we can predict how often, on the average,we will draw three black balls by chance tiom an equal mixture of white and black balls. It might be objectedthat actuarial tables make predictions con- cerning groups of people,not predictions concerning individuals. However, experienceteaches us that we can make predictions concerning the behavior of individual human beings as well as of groups of individuals.For instance,in countirg the frequencyof usageof the letter E in all English prose we will find that E con- stitutesabout 0.13of all the lettersappearing, while W, for instance, constitutesonly about 0.02 of all letters appearing. But, we also find almost the same proportions of E's and W's in the prose written by any one person. Thus, we can predict with someconfi- dencethat if you, or I, or Joe Doakes,or anyone elsewrites a long letter, or an articleoor a book, about 0.13of the letters he useswill be E's. This predictability of behavior limits our freedom no more than doesany other habit. We don't have to use in our writing the same 48 Symbols,Signals and Noise fraction of E's, or of any other letter, that everyoneelse does. In fact, severaluntrammeled individuals have broken away from the common pattern. William F. Friedman, the eminent cryptanalyst and author of The ShakesperianCipher Examined, has supplied me with the followitg examples. Gottlob Burmann,a German poet who lived from 1737to 1805, wrote 130poems, including a total of 20,000words, without once using the letter R. Further, during the last seventeenyears of his life, Burmann even omitted the letter from his daily conversation. In each of five storiespublished by Alonso Alcala y Herrerain Lisbon in I 641a different vowel was suppressed.Francisco Navar- rete y Ribera (1659),Fernando Jacinto de Zurrta y Haro (1654), and Manuel Lorenzo de Lrzarazv y Berbvizana (1654)provided other examples. In 1939, Ernest Vincent Wright published a 267-pagenovel, Gadsby,in which no useis made of the letter E. I quote aparugraph below:

Upon this basisI am goingto showyou how a bunchof brightyoung folks did find a champion; a man with boys and girls of his own; a man of so dominating and huppy individuality that Youth is drawn to him as is a fly to a sugar bowl. It is a story about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full of such customary "fiIl-ins'o o'romantic as moonlight casting murky shadows down a long, winding country road." Nor will it say anything about tinklings lulling distant folds; robins carolling at twilight, nor any'owarm glow of lamplight" from a cabin window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth as it is today; and a practical discarding of that worn- out notion that "a child don't know anything."

While such exercisesof free will show that it is not impossible to break the chainsof habit, we ordinarily write in a more conven- tional manner. When we are not going out of our way to demon- strate that we can do otherwise, we customarily use our due fraction of 0.13E's with almost the consistencyof a machineor a mathematical rule. We cannot argue from this to the converseidea that a machine into which the same habits were built could write English text. However, Shannonhas demonstrated how English words and text A MathematicalModel 49 can be approximatedby u mathematicalprocess which could be carried out by u machine. Suppose,for instance,that we merely produce a sequenceof lettersand spaceswith equal probabilities.We might do this by putting equal numbersof cards marked with eachletter and with th. spaceinto a hat, mixing them up, drawinE a card, recordirg its symbol, returnitg it, remixing, drawing another card, and so on. This giveswhat Shannoncalls the zero-orderapproximation to English text. His example,obtained by an equivalentprocess, goes: l. Zero-order approximation (symbols independentand equi- probable)

XFOML RXKHRJFFJUJ ZLPVVCFWKCYI FFJEYVKCQSGI{YD QPAAMKBZAACTBZLHJQD.

Here there are far too many Zs and Ws, and not nearly enough E's and spaces.We can approach more nearly to English text 6y choosing letters independently of one another, but choosing E more often than W or Z. We could do this by putting many E's and few W's and Z's into the hat, mixing, and drawing out the letters.As theprobability that a given letter is an E should be .13, out of everyhundred letters we put into the hat, 13 should be E's. As the probability that a letter will be W should be .02,out of each hundred letterswe put into the hat, 2 should be W's, and so on. Here is the result of an equivalent procedure,which gives what Shannoncalls a first-order approximation of English text: 2. First-order approximation (symbols independent but with frequenciesof English text).

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL

In English text we almost never encounter any pair of letters beginnitg with Q exceptQU. The probability of encounterirg QX ot QZ is essentiallyzero. White the probability of QU is not 0, it is so small as not to be listed in the tables I consulted.On the other hand, the probability of TH is .037,the probability of OR is .010 and the probability of WE is .006. These probabilities have the followitg meaning. In a stretch of text containing, say, 10,001 50 Symbols,Signals and Noise letters,there are 10,000successive pairs of letters,i.e., the first and second,the secondand third, and so on to the next to last and the last. Of the pairs a certainnumber are the letters TH. This might be 370 pairs. If we divide the total number of times we find TH, which we have assumedto be 370 times, by the total number of pairs of letters,which we have assumedto be 10,000,we getthe probability that a randomly selectedpair of letters in the text will be TH, that is, 370/ 10,000,or .03'l. Diligent cryptanalysts have made tables of such digramprob- abilities for English text. To seehow we might use thesein con- structirg sequencesof letters with the same digram probabilities as English text, let us assumethat we use 27 hats, 26 for digrams beginning with each of the letters and one for digrams beginning with a space.We will then put a large number of digrarnsinto the hats accordirg to the probabilities of the digrams- Out of 1,000 digramswe would put tn 37 TH's, l0 WE's, and so on. Let us consider fbr a moment the meaning of thesehats futl of digrams in terms of the original counts which led to the evaluations of digram probabilities. In going through the text letter by letter we will encounterevery T in the text. Thus, the number of digrams beginningwith T, all of which we put in one hat, will be the sameas the number of T's. The fraction theserepresent of the total number of digramscounted is the probability of encounteringT in the text; that is, .10. We might call this probabilityp(T)

PG) - 'lo We may note that this is also the fraction of digrams,distributed among the hats, which end in T as well as the fraction that begin with T. Again, basing our total numbers on I,001 letters of text, or 1,000digrams, the number of times the digram TH is encountered is 37, and so the probability of encounteringthe digram TH, which we might call/(T, H) is

p(T, H) - .037

Now we seethat 0.10,or 100,of the digrams will begin with T and hence will be in the T hat and of these 37 will be TH. Thuso A Mathematical Model 5 t the fraction of the T digramswhich are TH will be 37/ I00, or 0.37. Correspondingly,we say that the probability that a digram b.g*- ning with T is TH, which we might callpr(H), is

Pt(H) = .37 This is called the conditionalprobability that the letter following a T will be an H. One can use these probabilities, which are adequately repre- sentedby the numbers of various digrams in the various hats,in the constructionof text which has both the sameletter frequencies and digram frequenciesas does English text. To do this one draws the first digram at random from any hat and writes down its letters. He then draws a second digram from the hat indicated by the secondletter of the first digram and writes down the secondletter of this seconddigram. Then he draws a third digram from the hat indicated by the second letter of the second digram and writes down the secondletter of this third digram, and so on. The space is treatedjust like a letter. There is a particular probability that a space will follow a particular letter (ending a "word") and a particular probability that a particular letter will follow a space (starting a new "word"). By an equivalentprocess, Shannon constructed what he calls a second-orderapproximation to English; it is: 3. Second-orderapproximation (digram structureas in English).

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Cryptanalystshave evenproduced tablesgiving the probabilities of groups of three letters, called trigram probabilities. Thesecan be used to constructwhat Shannon calls a third-order approxima- tion to English.His examplegoes: 4. Third-order approximation (trigram structure as in English).

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE

When we examine Shannon's examples I through 4 we see an increasing resemblance to English text. Example l, the zero-order 52 Symbols,Signals and Noise approximatior, hasno wordlike combinations.In example2, which takes letter frequenciesinto account, ocRo and NAH somewhat resembleEnglish words. In example3, which takesdigram frequen- 'owords" ciesinto account,all the are pronounceable,and oN,ARE, BE, AT, and ANDv occur in English. In example 4, which takes trigram frequenciesinto account,we haveeight Englishwords and many English-soundingwords, such as cRocID, PoNDENoME,and DEMONSTURES. G. T. Guilbaud has carried out a similar processusing the statisticsof Latin and has so produceda third-order approximation (one taking into account trigram frequencies)resembling Latin, which I quote below:

IBUS CENT IPITIA VETIS IPSE CUM VIVIVS SE ACETITI DEDENTUR

The underlined words are genuineLatin words. It is clear from such examplesthat by giving a machinecertain statistics of a language, the probabilities of finditg a particular letter or group of l, or 2, or 3, or n letters, and by giving the machine an ability equivalent to picking a ball from ahat, flipping a coin, or choosirg a random number, we could make the machine produ ce a close approximation to English text or to text in some other language. The more we gave the machire, the more closely would its product resembleEnglish or other text, both in its statistical structure and to the human eye. If we allow the machine to choosegroups of three letters on the basis of their probability, then any three-lettercombination which it produces must be an English word or a pafi of an English word and any two letter "word" must be an English word. The machine is, however, less inhibited than a person, who ordinarily writes down only sequencesof letters which do spell words. Thus, he missesever writing down pompousPoNDENoME, suspect ILONASTVE, somewhat vulgar cRocrD, learned DEMoNSTURES,and wacky but delightful DEAMv.Of course, a man in principle could wnte down such combinations of letters but ordinarily he doesn't. We could cure the machine of this ability to produce un-English words by making it chooseamong groups of letters as long as the longest English word. But, it would be much simpler merelyto A Mathematical Model 53 supply the machine with words rather than letters and to let it produce thesewords accordingto certain probabilities. Shannon has given an example in which words were selected independently, but with the probabilities of their occurring in English text, so that the,and, mary etc., occur in the samepropor- tion as in English. This could be achievedby cutting text into words, scrambling the words in a hat, and then drawing out a successionof words. He callsthis a first-order word approximation. It runs as follows: 5. First-orderword approximation.Here words are choseninde- pendently but with their appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CANI DIFFERENT NATURAL HERE HE TI{E A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE There are no tableswhich give the probability of different pairs of words. However, Shannon constructed a random passagein which the probabilities of pairs of words were the same as in English text by the followirg expedient. He chose a first pair of words at random in a novel. He then looked through the novel for the next occurrenceof the secondword of the first pair and added the word which followed it in this new occurrence,and so on. This processgave him the followirg second-orderword approxi- mation to English. 6. Second-orderword approximation.The word transition prob- abilities are correct, but no further structure is included.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF TI{IS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS TI{AT THE TIME OF WHO EVER TOLD TIIE PROBLEM FOR AN LTNEXPECTED.

We see that there are stretches of several words in this final passagewhich resembleand, indeed, might occur in English text. Let us considerwhat we have found. In actual English text, in that text which we sendby teletypewriter,for instance,particular letters occur with very nearly constantfrequencies. Pairs of letters 54 Symbols,Signals and Noise and triplets and quadruplets of letters occur with almost constant frequencies over long stretchesof the text. Words and pairs of words occur with almost constantfrequencies. Further, we can by meansof a random mathematicalprocess, carried out by u machine if you like, produce sequencesof English words or lettersexhibiting thesesame statistics. Such a scheme,even if refined greatly, would not, however, produce all sequencesof words that a person might utter. Carried to an extreme, it would be confined to combinationsof words which had occurred; otherwise,there would be no statisticaldata avarlableon them. Yet I may sal, "The magenta typhoon whirled the farded bishop aw&/," and this may well neverhave been said before. The real rulesof English text deal not with lettersor wordsalone but with classesof words and their rulesof association,that is, with grammar. Linguists and engineerswho try to make machinesfor translating one languageinto anothermust find theserules, so that their machinescan combine words to form grammaticalutterances even when these exact combinations have not occurred before (and also so that the meaning of words in the text to be translated can be deducedfrom the context).This is a big problem.It is easy, 'omachine" however,to describea which randomly producesend- less,grammatical utterancesof a limited sort. Figure III-I is a diagram of such a "machine." Each numbered box representsa stateof the machine.Because there is only a finite number of boxes or states,this is called a finite-statemachine. From each box a number of arrows go to other boxes.In this particular machioe,only two arrows go from each box to eachof two other boxes.Also, in this case,each arrow is labeled/2. This indicates that the probability of the machine passingfrom, for instance, state 2 to state 3 is Vzand the probability of the machine passingfrom state 2 to state4 is Vz. To make the machine run, we need a sequenceof random choices,which we can obtain by flipping a coin repeatedly.We can let heads(H) meanfollow the top aruowand tails (n, follow the bottom arrow. This will tell us to passto a new state.When we do this we print out the word, words, or symbol written in that state box and flip again to get a new state. A Mathematical Model 55

uf o9-) ,- < fi': tL Jr+.A ^nv (J= lL> ut o,

a F U' a ut za (,E, ?k z o =f U o U \ I \ \,:F o uf o k UJ I (9 F &. o :) ul 0- =

F o ;> ru1F F=$ *8 56 Symbols,Signals and Noise As an example,if we started in stateT and flipped the following sequenceof headsand tails: T H H H T T H T T T H H H H, the "machine would print out"

THE COMMUMST PARTY IIWESTIGATED THE CONGRESS. THE COMMUMST PARTY PURGED THE CONGRESS ANID DESTROITD TIIE COMMUNIST PARTY AI{D FOIIND EVIDENCE OF TIIE CONGRESS. This can go on and or, never retracing its whole course and producing "sentences"of unlimited length. Random choiceaccording to a table of probabilities of sequences of symbols (letters and space) or words can produce material resembtittg English text. A finite-state machine with a random choice among allowed transitions from state to state can produce material resemblingEnglish text. Either processis called a stochas- tic ptocess,because of the random elementinvolved in it. We have examineda number of propertiesof English text. We have seenthat the avera1efrequency of E's is commonly constant for both the English text produced by one writer and, also,for the text produced by all writers. Other more complicated statistics, such as the frequencyof digrams (TH, WE, and other letter pairs), are also essentiallyconstant. Further, we have shown that English- like text can be produced by u sequenceof random choices,such as drawings of slips of paper from hats, or flips of a coin, if the proper probabilities are in some way built into the process.One way of producing such text is through the use of a finite-state machine, such as that of Figure III- l. We have been seeking a mathematical model of a sourceof English text. Such a model should be capableof producing text which correspondsclosely to actual English text, closely enough so that the problem of encodittg and transmitting such text is essentiallyequivalent to the problem of encoditg and transmitting actual English text. The mathematical propertiesof the modelmust be mathematically defined so that useful theorems can be proved concerning the encodirg and transmission of the text is produces, theorems which are applicable to a high degreeof approximation to the encodingof actual English text. It would, however,be asking too much to insist that the production of actual English text con- form with mathematical exactitude to the operation of the model. A MathematicalModel 57 The mathematicalmodel which Shannon adopted to represent the production of text (and of spokenand visual messagesas well) is the ergodicsource. To understandwhat an ergodic sourceis, we must first understandwhat a stationarysource Ls,and to explain this is our next order of business. The generalidea of a station ary source is well conveyedby the name.Imagine, for instance,a process,i.e. , &nimaginarymachine, that producesforever after it is startedthe sequencesof characters AEAEAEAEAE,etc. Clearly, what comes later is like what has gone before, and stationaryseems an apt designationof such a sourceof characters. We might contrast this with a source of characters which, after starting, produced A E A A E E A A A E E E,etc. Here the stringsof A's and E's get longer and longer without end' certainly this is not a stationarysource. Similarly, u sequenceof characterschosen atrandom with some assignedprobabilities (the first-order letter approximation of ex- ample I above) constitutes a stationary source and so do the digram and trigram sourcesof examples2 and 3. The generalidea of a stationarysource is clear enough.An adequatemathematical definition is a little more difficult. The idea of stationarity of a source demands no changewith time. Yet, considera digram source,in which the probability of the secondcharacter depends on what the previous characteris. If we start such a source out on the letter A, several different letters can follow, while if we start such a source out on the letter Q, the secondletter must be LJ.In general,the manner of starting the sourcewill influencethe statisticsof the sequenceof characters produced, at least for somedistance from the start. To get around this, the mathematician says,let us not consider just one sequenceof charactersproduced by the source.After all, our sourceis an imaginary machine,and we can quite well imagine thatit hasbeen started an infinite number of times, so as to produce an infinite number of sequencesof characters. Such an infinite number of sequencesis called an ensembleof sequenc€s. Thesesequences could be startedin any specifiedmanner. Thus 58 Symbols,Signals and Noise in the caseof a digram source,we can if we wish start a fraction, 0. 13, of the sequenceswith E (this is just the probability of E in English text), ; fraction, 0.02,with W (the probability of W), and so on. If we do this,we will find that the fraction of E's is the same, averagirg over all thefirst letters of the ensembleof sequences,as it is averaging over all the secondletters of the ensemble,as it is avera1irg over all the third letters of the ensemble,and so on. No mattei what position from the beginnirg we choose,the fraction of E's or of any other letter occurring in that position, takenover all the sequencesin the ensemble,is the same.This independence with respectto position will be true also for the probability with which TH or WE occurs among the first, second,third, and sub- sequentpairs of letters in the sequencesof the ensemble. ttris ii what we mean by stationarity. If we can find a way of assignirg probabilitiesto the various startittgconditions used in formingltie ensembleof sequencesof characterswhich we allow the sourceto produce,probabilities suchthat any statisticobtained by averagingover the ensembledoesn't dependon the distance from theitart at which we take an average,then the sourceis said to be stationary.This may seemdifficult or obscureto the reader, but the difficulty arisesin giving a useful and exact mathematical form to an idea which would otherwise be mathematicallyuseless. In the argument above we have, in discussingthe infinite en- semble of sequencesproduced by u source,considered averaging (or over-all first charactersor over-all secondor third characters pairs, or triples of characters,as other examples).Such an average is called an ensembleaverage.It is different from a sort of averuge we talked about earlier in this chapter, in which we lumped together all the charactersrn one sequenceand took the average over them. Such an averageis called a time average. The time averageand the ensemble averagecan be different. For instance,consider a sourcewhich startsa third of the time with A and producesalternately A and B, & third of the time with B and products alternately B and A, and a third of the time with E and produces a string of E's. The possiblesequences are l. A B A B A B A B,etc. 2. B A B A B A B A,etc. 3. E E E E E E E E,etc. A Mathematical Model 59

We can see that this is a stationary source,/et we have the probabilities shownin Table V.

Tnnrs V

Probability Time Average Time Average Time Average Ensemble of Sequence(I) Sequence (2) Sequence (3) Average A V2 Y2 0 Vt B V2 r/2 0 Yt E 0 0 I YJ

When a sourceis stationary, and when every possibleensemble average(of letters,digraffis, trigrams, etc.) is equal to the corre- sponditg time average,the source is said to be ergodic. The theoremsof information theory which are discussedin subsequent chapters apply to ergodic sources,and their proofs rest on the assumptionthat the messagesource is ergodic.t While we have here discusseddiscrete sourceswhich produce sequencesof characters,information theory also deals with con- tinuous sources,which generatesmoothly varying signals,such as the acoustic wavesof speechor the fluctuating electric currents which correspondto thesein telephony.The sourcesof such signals are also assumedto be ergodic. Why is an ergodicmessage source an appropriateand profitable mathematicalmodel for study?For one thing, we seeby examining the definition of an ergodic source as given above that for an ergodic sourcethe statisticsof a message,for instance,the fre- quency of occurrenceof a letter, such as E, or of a digram, such as TH, do not vary along the length of the message.As we an alyze a longer and longerstretch of a message,we get a better and better estimate of the probabilities of occurrenceof various letters and letter groups.In other words, by examininE a longer and longer stretch of a messagewe are able to arrive at and refine a mathe- matical descriptionof the source. Further, the probabilities,the description of the sourcearrived at through such an examination of one message,apply equally well to all messagesgenerated by the sourceand not just to the 1 Some work has been done on the encoding of nonstationary sources, but it is not discussed in this book. 60 Symbols,Signals and Noise particular messageexamined. This is assuredby the fact that the time and ensembleavetages are the same. Thus, an ergodic sourceis a particularly simple kind of prob- abilistic or stochasticsource of messages,and simpleprocesses are easierto dealwith mathematicallythan are complicatedprocesses. However, simplicity in itself is not enough. The ergodic source would not be of interest in communication theory if it were not reasonably realistic as well as simple. Communication theory has two sides.It has a mathematically exact side,which dealsrigorously with hypothetical,exactly ergodic sources, sourceswhich we can imagine to produce infinite en- semblesof infinite sequencesof symbols.Mathematically, we ate free to investigate rigorously either such a source itself or the infinite ensembleof messageswhich it can produce. We use the theorems of communication theory in connection with the transmissionof actual English text. A human beingis not a hypotheti cal, mathematically defined machine. He cannot pro- duce even one infinite sequenceof characters,let alone an infinite ensembleof sequences. A man does,however, produce many long sequencesof charac- ters, and all the writers of English together collectively producea great many such long sequencesof characters.In fact, pafi of this huge output of very long sequencesof charactersconstitutes the messagesactually sent by teletypewriter. We wilI, thus, think of all the different Americans who write out telegrams in English as being, approximately at least, an ergodic source of telegraphmessages and of all Americans speakingover telephonesas bein1, approximately at least, an ergodic sourceof telephone signals.Clearly, however, all men writitg French plus all men writirg English could not constitute an ergodic source.The output of each would have certain time-averageprobabilities for letters, digraffis, trigrams, words, and so or, but the probabilities for the English text would be different from the probabilitiesfor the French text, and the ensembleavera1e would resembleneither. We will not assertthat all writers of English (and all speakers of English) constitute a strictly ergodic messagesource. The statis- tics of the English we produce change somewhat as we change subject or pu{pose,and different peoplewrite somewhatdifferently. A MathematicalModel 6l Too, in producing telephone signals by speakitrB,some people speak softly, somebellow, and some bellow only when they are angry. What we do assertis that we find a remarkable uniformity in many statisticsof messages,as in the caseof the probability of E for different samples of English text. Speech and writing as ergodic sourcesare not quite true to the real world, but they are far truer than is the economic man. They are true enough to be useful. This differencebetween the exactly ergodicsource of the mathe- matical theory of communication and the approximately ergodic messagesources of the real world shouldbe kept in mind. Wemust exercisea reasonablecaution in applying the conclusionsof the mathematicaltheory of communicationto actualproblems. We are used to this in other fields.For instance,mathematics tells us that we can deducethe diameter of a circle from the coordinatesor locations of any three points on the circle, and this is true for absolutely exact coordinates.Yet no sensibleman would try to determine the diameter of a somewhat fuzzy real circle drawn on a sheetof paperby tryitg to measurevery exactly the positionsof three points a thousandth of an inch apart on its circumference. Rather, he would draw a line through the center and measurethe diameter directly as the distance betweendiametrically opposite points. This is just the sort of judgment and caution one must alwaysuse in applyingan exact mathematicaltheory to an inexact practical case. Whatever caution we invoke, the fact that we have used a ran- dom, probabilistic,stochastic process as a model of man in his role of a messagesource raises philosophical questions. Does this mean that we imply that man acts at random? There is no such impli- cation. Perhapsif we knew enough about a man, his environment, and his history, we could always predictjust what word he would write or speaknext. In communication theory, however, we assumethat our only knowledge of the messagesource is obtained either from the messagesthat the sourceproduces or perhapsfrom someless-than- complete study of man himself. On the basis of information so obtained, we can derive certain statisticaldata which, &swe have seen,help to narrow the probability as to what the next word or 62 Symbols,Signals and Noise letter of a messagewill be. There remains an element of uncer- tainty. For us who have incomplete knowledgeof it, the message source behavesa^r if certain choiceswere made at random, insofar as we cannot predict what the choiceswill be. If we could predict them, we should incorporate the knowledgewhich enablesus to make the predictionsinto our statisticsof the source.If we had more knowledge,however, we might seethat the choiceswhichwe cannot predict are not really random, in that they are (on the basis of knowledge that we do not have) predictable. We can seethat the view we have taken of finite-statemachines, such as that of FigureIII- 1,has been limited. Finite-statemachines 'fhe can have inputs aswell as outputs. transition from a particular state to one among severalothers neednot be chosenrandomly; it could be determined or influenced by various inputs to the machine. For instance,the operation of an electronicdigital com- puter, which is a finite-statemachine, is determinedby the program and data fed to it by the programmer. It is, in fact, natural to think that man may be a finite-state machine, not only in his function as a message source which pro- duces words, but in all his other behavior as well. We can think if we like of all possibleconditions and configurationsof the cells of the nervoussystem as constitutingstates (states of mind, perhaps). We can think of one state passingto another, sometimeswith the production of a letter, word, sound, or a part thereof, and some- times with the production of some other action or of somepart of an action. We can think of sight, hearing, touch, and other senses as supplyitg inputs which determine or influence what statethe machine passesinto next. If man is a finite-state machine, the number of statesmust be fantastic and beyond any detailedmathe- matical treatment. But, so are the configurationsof the molecules in a Bos,and yet we can explain much of the significantbehavior of a gas in terms of pressureand temperaturemerely. Can we somedaysay valid, simple, and important things about the working of the mind in producing writtea text and other things as well? As we have seen, we can already predict a good deal concerning the statisticalnature of what a man will write down on paper, unlesshe is deliberatelytrying to behaveeccentrically, and, even then, he cannot help conforming to habits of his own. Such broad considerationsare not, of course,the real purpose A Mathematical Model 63 or meat of this chapter.We set out to find a mathematicalmodel adequateto representsome aspects of the human being in his role as a sourceof messagesand adequateto representsome aspects of the messageshe produces.Taking English text as an example, we noted that the frequenciesof occurrenceof various letters are remarkably constant,unless the writer deliberately avoids certain letters. Likewise, frequenciesof occurrence of particular pairs, triplets, and so or, of letters are very nearly constant, as are frequenciesof variouswords. We also saw that we could generatesequences of letters with frequenciescorresponding to thoseof English text by various ran- dom or stochasticprocesses, such as, cutting a lot of text into letters (or words),scrambling the bits of paperin ahat, and drawing them out one at a time. More elaboratestochastic processes, includirg finite-statemachines, can produce an even closerapproximation to English text. Thus, we take a genetaltzedstochastic process as a model of a messagesource, such as, a sourceproducing English text. But, how must we mathematicallydefine or limit the stochasticsources we deal with so that we can prove theoremsconcerning the encodirg of messagesgenerated by the sources?Of course,we must choose a definition consistentwith the characterof real English text. The sort of stochasticsource chosen as a model of actual message sourcesis the ergodicsource. An ergodicsource can be regarded as a hypothetical machine which producesan infinite number of or ensembleof infinite sequencesof characters.Roughly, the nature or statisticsof the sequencesof charactersor messagesproduced by an ergodic sourcedo not changewith time; that is, the source is stationary.Further, for an ergodic sourcethe statisticsbased on one messageapply equally well to all messagesthat the source generates. The theoremsof communication theory are proved exactly for truly ergodic sources.A11 writers writing English text together constitute an opproximatelyergodic source of text. The mathe- matical model-the truly ergodic source-is closeenough to the actual situation so that the mathematics we base on it is very useful. But we must be wise and careful in applying the theorems and resultsof communicationtheory, which areexact for a mathe- matical ergodic source,to actual communicationproblems. CHAPTERIV Encoding and BinaryDigits

A souRCEoF INFoRMATToNmay be English text, a man speaking, the sound of an orchestra,photographs, motion picture films, or scenesat which a televisioncamer a may be pointed. We haveseen that in information theory suchsources are regarded as having the properties of ergodic sources of letters, numbers, characters,or electrical signals.A chief aim of information theory is to studyhow such sequencesof charactersand such signalscan be most effec- tively encoded for transmission, commonly by electrical means. Everyone has heard of codes and the encodirg of messages. Romantic spies use secret codes. Edgar Allan Poe popularized in The Gold Brg. The country is full of amateur cryptanalystswho delight in trying to read encodedmessages that others have devised. In this historical senseof cryptography or secretwriting, codes are usedto concealthe content of an important messagefrom these for whom it is not intended. This may be done by substitutingfor the words of the messageother words which are listed in a code book. Or, in a type of code called a , letters or numbersmay be substituted for the letters in the messageaccordiog to some previously agreedupon secretscheme. The idea of encoding, of the accurate representationof one thing by another, occurs in other contexts as well. Geneticists believethat the whole plan for a is written out in the Encodingand Binary Digits 65 chromosomesof the germ cell. Someassert that the "text" consists of an orderly linear arrangementof four different units, or "bases," in the DNA (desoxyribonucleicacid) forming the chromosome. This text in turn producesan equivalent text in RNA (ribonucleic acid), and by means of this RNA text proteins made up of sequencesof lwenty amino acids are syntheirzed.Some cryptana- lytic effort has been spent in an effort to determine how the four- charactermessage of RNA is reencodedinto the twenty-character code of the protein. Actually, geneticistshave been led to such considerationsby the existenceof information theory. The study of the transmissionof information has brought about a new generalunderstanding of the problems of encoding,an understandingwhich is important to any sort of encoding,whether it be the encodirg of cryptography or the encodirg of geneticinformation. We have already noted in Chapter II that English text can be encodedinto the symbolsof Morse code and representedby short and long pulsesof currentseparated by short and long spaces.This is one simple form of encoding.From the point of view of infor- mation theory, the electromagneticwaves which travel from an FM to the receiverin your home are an encoding of the music which is transmitted. The electric currents in telephone circuits are an encodingof speech.And the sound wavesof speech arethemselves an encodingof the motions of the vocal tract which produce them. Nature has specifiedthe encodirg of the motions of the vocal tract into the sounds of speech.The communication engineer, however,can choosethe form of encoding by meansof which he will representthe soundsof speechby electric currents,just as he can choosethe codeof dots,dashes, and spacesby meansof which he representsthe lettersof English text in telegraphy.He wants to perform this encodingwell, not poorly. To do this he must have somestandard which distinguishesgood encoding from bad encod- ing, and he must have someinsight into meansfor achievitg good encodirg. We learnedsomething of thesematters in Chapter II. It is the study of this probleffi, & study that might in itself seem limited, which has provided through information theory new ideas important to all encoding,whether cryptographic or genetic.These 66 Symbols,Signals and Noise new ideas include a measure of amount of information, called entropy,and a unit of measurement,called the bit. I would like to believe that at this point the readeris clamotirg to know the meaning of "amount of informationo' as measuredin bits, and if so I hope that this enthusiasm will caffy him over a considerableamount of intervening material about the encoding of messages. It seemsto me that one can't understand and appreciatethe solution to a problem unlesshe has sorneidea of what the problem is. You can't explain music meaningfully to a man who has never heard any. A story about your neighbor may be full of insight,but it would be wastedon a Hottentot. I think it is only by considering in somedetail how a messagecan be encodedfor transmissionthat we can come to appreciate the need for and the meaning of a measureof amount of information. It is easiestto gain someunderstanding of the important prob- lems of codirg by consideringsimple and concreteexamples. Of course,in doing this we want to learn somethingof broad value, and here we may foreseea difficulty. Someimportant messagesconsist of sequencesof discretechar- acters,such as the successiveletters of English text or the successive digits of the output of an electronic computer. We have seen, however, that other messagesseem inherently different. Speechand music arevariations with time of the pressureof air at the ear. This pressurewe can accurately representin telephony by the voltage of a signal traveling along a wire or by someother quantity. Such a variation of a signal with time is illustrated tn a of Figure IV- L Here we assumethe signal to be a voltage which varies with time, as shown by the wavy line. Information theory would be of limited value if it were not applicable to such continuoussignals or messagesas well as to discretemessages, such as English text. In dealing with continuous signals, information theory first invokes a mathematical theorem called the sampling theorem, which we will use but not prove. This theorem statesthat a con- tinuous signal can be representedcompletely by and reconstructed perfectly fronr a set of measurementsor samplesof its amplitude which are macieat equally spacedtimes. The interval betweensuch Encodingand Binary Digits 67

lo (a) UJu f J o (b)

TIME --t> Fig. IV-l samplesmust be equalto or lessthan one-halfof the period of the highestfrequency present in the signal.A setof suchmeasurements or samplesof the amplitude of the signal 4 Figure IV- l, is repre- sented by a sequenceof vertical lines of various heights rn b of Figure IV- l. We should particularly note that for such samplesof the signal to represent a signal perfectly they must be taken frequently enough.For a voice signalincluding frequenciesfrom 0 to 4,000 cyclesper secondwe must use 8,000samples per second.For a signal includirg frequenciesfrom 0 to 4 million cycles per secondwe must use8 million samplesper second.In general, if the frequency rangeof the signal it "f cyclesper secondwe must use at least 2f samplesper secondin order to describeit perfectly. Thus, the samplingtheorem enablesus to representa smoothly varying signal by a sequenceof samples which have different amplitudesone from another.This sequenceof samplesis, how- ever, still inherently different from a sequenceof letters or digits. There areonly ten digits and there are only twenty-six letters,but a samplecan have anyof an infinite number of amplitudes.The amplitude of a samplecan lie anywherein a continuousrange of values,while a characteror a digit has only a limited number of discretevalues. The manner in which information theory copes with samples having a continuousrange of amplitudesis a topic all in itselfl,to which we will return later. Here we will merely note that a signal 68 Symbols, Signals and Noise

need not be describedor reproducedperfectly. Indeed, with real physical apparatusa signaI cannot be reproducedperfectly. In the transmissionof speech,for instance,it is sufficientto representthe amplitude of a sample to an accuracyof about I per cent. 1lhus, we can, if we wish, restrict ourselvesto the numbers 0 to 99 in describitg the amplitudes of successivespeech samples and repre- sent the amplitude of a given sampleby that one of thesehundred integers which is closestto the actual amplitude. By so quantizing the signal samples,we achieve a representationcomparable to the discretecase of English text. We can, then, by sampling and quantizing, convert the problem of coditg a continuoussignal, such as speech,into the seemingly simpler problem of codirg a sequenceof discretecharacters, such as the letters of English text. We noted in Chapter II that English text can be sent,letter by letter, by means of the Morse code. In a similar manner, such messagescan be sent by teletypewriter. Pressing a particular key on the transmitting machine sendsa particular sequenceof elec- trical pulsesand spacesout on the circuit. When thesepulses and spacesreach the receivirg machine, they activatethe corresponding type bar, and the machine prints out the characterthat was trans- mitted. Patterns of pulsesand spacesindeed form a particularly useful and general way of describing or encoding messages.Although Morse code and teletypewriter codesmake useof pulsesand spaces of different lengths,it is possible to transmit messagesby means of a sequenceof pulses and spacesof equal length, transmittedat perfectly regular intervals. Figure IV-Z shows how the electric current sent out on the line varies with time for two different patterns, each six intervals long, of such equal pulsesand spaces. Sequencea is a pulse-space-space-pulse-space-pulse.Sequence b is pulse-pulse-pulse-space-pulse-pulse. The presenceof a pulse or a spacein a given interval specffies one of two differentpossibilities. We could useany pair of symbols to representsuch patterns of pulses or spacesas those of Figure lV-z: yes,no; *, -; l, 0. Thus we could representpatterna as follows: Encoding and Binary Digits 69

pulse space space pulse space pulse Yes No No Yes No Yes +++ l00l0l The representationby 1 or 0 is particularly convenient and important. It can be usedto relatepatterns of pulsesto numbers expressedin the binarysystem of notation. When we write 315we mean 3X102+lxl01 +5Xl -3X100+lxl0+5Xl _ 315 In this ordinary decimalsystem of representirgnumbers we make useof the ten differentdigitsl 0, 1,2,3,4, 5,6,7,8, 9. In the binary systemwe useonly two digits, 0 and l. When we write I 0 0l0lwemean I X 25+0X 24+0X 23+ I X 22+ 0X 2+ I X I = I X 32+0x 16+0X8+ I X 4+0X 2+ I X I = 37 tn decimalnotation It is often convenientto let zerosprecede a number; this does not changeits value.Thus, in decimal notation we can sa), 0016- 16

TIME F-t> Fig. IV-2 I

70 Symbols, Signals and Ncise Or in binary notation 001010_ 1010 In binary numbers,each 0 or I is a binary digit. To describethe pulses or spacesoccurrirg in six successiveintervals, we can use a sequenceof six binary digits. As a pulse or spacein one interval is equivalent to a binary digit, we can also refer to a pulsegroup of six binary digits, or we can refer to the pulse or spaceoccurring in one interval as one binary digit. Let us considerhow many patterns of pulsesand spacesthere are which are three intervals long. In other words, how many three-digit binary numbers are there? These are all shown in Table VI. Tlnrn VI 000 (0) 001 (l) 010 (2) 0ll (3) 100 (4) l0l (5) ll0 (6) lll (7)

The decimal numbers correspondingto these sequencesof l's and 0's regardedas bin ary numbers are shown in parenthesesto the right. We see that there are 8 (0 and I through 7) three-digit binary numbers. We may note that 8 is 23. We catr,in fact, regard an orderly listing of binary digits n intervals long as simply setting down 2n successivebinary numbers, starting with 0. As examples, in Table VII the numbers of different patterns correspondingto different numbers r?of binary digits are tabulated. We see that the number of different patterns increasesvery rapidly with the number of binary digits. This is becausewe double the number of possiblepatterns each time we add one digit. When we add one digit, we get all the old sequencespreceded by a 0 plus all the old sequencespreceded by a l. The binary systemof notation is not the only alternativeto the Encoding ond Binary Digits 7l

Tanu VII

n (Number of Binary Digits) Number of Patterns (2") I 2 2 4 3 8 4 t6 5 32 l0 1,024 20 1,048,576 decimal system.The octal systemis very important to people who use computers.We can regard the octal systemas made up of the eightdigits 0, 1,2,3,4,5, 6,7. When we write 356 in the octal systemwe mean 3x82+5x8+6xl -3 x & + 5 x 8 + 6 X I - 238 in decimal notation We can convertback and forth betweenthe octal and the binary 'We systemsvery simply. needmerely replaceeach successive block of threebinary digits by the appropriate octal digit, as,for instance, binary 010 lll 011 ll0 octal 2736 Peoplewho work with binary notation in connection with com- puters find it easierto remember and transcribea short sequence of octal digits than a long group of bin ary digits. They learn to regard patternsof threesuccessive binary digits as an entity, so that they will think of a sequenceof twelve binary digits as a succession of four patternsof three,that is, as a sequenceof four octal digits. It is interestingto note, too, that, just as a pattern of pulsesand spacescan correspondto a sequenceof binary digits, so a sequence of pulsesof variousamplitudes (0, I ,2,3,4,5,6,7) can correspond to a sequenceof octal digits. This is illustrated in Figure IV-3. fn a, we have the sequenceof off-on, 0- l pulsescorresponding to the 0l0l l10l I I10. The correspondingoctal numberis 2736,and tn b this is representedby a sequenceof four pulsesof current having amplitudes2, 7, 3, 6. 72 Symbols, Signals and Noise

(a)

6 4 2 o (b) l Fig. IV-3

Conversionfrom binary to decimal numbers is not so easy.On the average,it takes about 3.32 binary digits to representone decimal digit. Of coursewe can assignfour binary digits to each decimal digit, as shown in Table VIII, but this meansthat some patterns are wasted; there are more patterns than we use. It is convenientto think of sequencesof 0's and I osor sequences of pulsesand spacesas binary numbers.This helps us to under-

Tnsrn VIII

Binarv Number Decimal Digit 0000 0 0001 I 0010 2 001l 3 0100 4 0l0l 5 0l l0 6 0lll 7 1000 8 l00l 9 l0l0 not used l0l I not used l100 not used l10l not used lll0 not used llll not used Encodingand Binary Digits 73 standhow many sequencesof a differentlength thereare and how numbers written in the binary system correspond to numbers written in the octal or in the decimal system.In the transmission of information, however, the particular number assignedto a sequenceof binary digits is irrelevent.For instance,if we wish merely to transmitrepresentations of octal digits, we could make the assignmentsshown in Table IX rather than those in Table VI.

Tnnln IX

Sequenceof Binary Digits Octal Digit Represented 000 5 001 7 010 I 0ll 6 100 0 l0l 4 ll0 2 lll 3

Here the "binary numbers" in the left column designateoctal numbers of different numerical value. In fact, thereis anotherway of lookirg at sucha correspondence betweenbinary digits and other symbols,such as octal digits,a way in which we do not regard the sequenceof binary digits as part of a binary number but rather as meansof choosingor designating a particular symbol. We can regard each0 or I as expressingan elementarychoice between two possibilities. Consider, for instance, the "tree of choice" shownin Figure IV-4. As we proceedupward from the root to the twigs, let 0 signify that we fake the left branch and let I signify that we take the right branch. Then 0 I I means left, right, right and takesus to the octal digit 6, just as in Table IX. Just as three binary digits give us enoughinformation to deter- mine one amongeight alternatives,four binary digits can determine one amongsixteen alternatives, and twenty binary digits candeter- mine one among 1,048,576alternatives. We can do this by assign- itg the required bin ary numbers to the alternatives in any order we wish. Symbols, Signals and Noise 7 | 60 4 2

Fig. IV-4

The alternatives which we wish to specify by successionsof binary digits neednot of coursebe numbersat all. In fact,we began by consideringhow we might encodeEnglish text so as to transmit it electrically by sequencesof pulses and spaces,which can be representedby sequencesof binary digits. A bare essentialin transmitting English text letter by letter is twenty-six lettersplus a space,or twenty-sevensymbols in all. This of course allows us no punctuation and no Arabic numbers. We can write out the numbers (three, not 3) if we wish and use words for punctuation,(stop, comma, colonoetc.). Mathematics saysthat a choice among 27 symbolscolresponds to about 4.75 binary digits. If we arc not too concernedwith efficienc), we can assigna different 5-digit binary number to each character, which will leave five 5-digit binary numbers unused. My typewriter has 48 keys, including shift and shift lock. We o'symbols" might add two more representingcarriage return and line advance,making a total of 50. I could encode my actionsin typing, capit ahzatioil,punctuation, and all (but not insertion of the paper) by a successionof choicesamong 50 symbols,each choice corresponding to about 5.62 binary digits. We could use 6 binary digits per characterand waste some sequencesof bin aty digits. This waste arisesbecause there are only thirty-two 5-digit binary numbers, which is too few, while there aresixty-four 6-digit binary numbers, which is too many. How can we avoid this waste?If we have 50 characters,w€ have 125,000possible different groupsof 3 ordered characters.There are 13l ,072different combinationsof Encodingand Binary Digits 75 l7 bin ary digits.Thus, if we divide our text into blocksof 3 succes- sive characters,we can specify any possible block by a l7-digit binary number and have a few left over. [f we had representedeach separatecharacter by 6 binary digits, we would have needed18 binary digits to represent3 successivecharacters. Thus, by this block coding,we have cut down the number of binary digits we use in encodinga given length of text by u factor 17/ 18. Of course,we might encodeEnglish text in quite a different way. We can say a good deal with 16,384English words. That's quit e a large vocabulury.There arejust 16,384fourteen-digit binary num- bers.We might assign16,357 of theseto different useful words and 27 to the letters of the alphabet and the space,so that we could spell out any word or sequenceof words we failed to include in our word vocabulary.We won't needto put a spacebetween words to which numbershave been assigned;rt can be assumedthat a spacegoes with each word. If we have to spell out words very infrequently, we will useabout 14 binary digits per word in this sort of encoding. In ordinary English text there are on the averageabout 4.5 letters per word. As we must separatewords by a space,when we sendthe message character by character, even if we disregard capitalization and punctuation, we will require on the average 5.5 charactersper word. If we encodethese using 5 binary digits per character,we will useon the average27 .5 binary digits per word, while in encod- itg the messageword by word we need only 14 binary digits per word. How can this be so?It is because,in spelling out the message letter by letter, we have provided means for sending with equal facility all sequencesof English letters,while, in sendingword by word, we restrict ourselvesto English words. Clearly, the averagenumber of binary digits per word required to representEnglish text dependsstrongly on how we encodethe text. Now, English text is just one sort of messagewe might want to transmit. Other messagesmight be stringsof numbers,the human voice, a motion picture, or a photograph.If there are efficientand inefficientways of encodingEnglish text, we may expectthat there will be efficient and inefficient ways of encodirg other signals as well. 76 Symbols, Signals and Noise Indeed, we may be led to believethat there existsin principle some bestway of encodingthe signalsfrom a given messagesource, a way which will on the averagerequire fewer bin ary digits per character or per unit time than any other way. If there is such a best way of encoding a signal,then we might use the averagenumber of bin ary digits required to encodethe signal as a measureof the amount of information per characteror the amount of information per secondof the messagesource which produced the signal. This is just what is done in information theory. How it is done and further reasonsfor so doing will be consideredin the next chapter. Let us first, however,review very briefly what we have covered in this chapter. fn communication theory, wo regard coding very broadly, as representingone signal by another.Thus a radio wave can represent the soundsof speechand so form an encodingof these sounds. Encodirg is, however, most simply explainedand explored in the caseof discretemessage sources, which produce messagesconsisting of sequencesof charactersor numbers.For- tunately, we can representa continuous signal,such as the current in a telephoneline, by u number of samplesof its amplitude,using, each second, twice as many samples as the highest frequenc) present in the signal.Further we can if we wish representthe ampli- tude of each of thesesamples approximately by u whole number. The representationof letters or numbers by sequencesof off-or- on signals,which can in turn be representeddirectly by sequences of the binary digits 0 and 1, is of particular interest in communi- cation theory. For instance,by using sequencesof 4 binary digits we can form 16 binary numbers, and we can use l0 of theseto representthe l0 decimaldigits. Or, by usingsequences of 5 binary digits we can form 32 btnary numbers, and we can use 27 of these to representthe lettersof the English alphabetplus the space.Thus, we can transmit decimal numbers or English text by sending sequencesof off-or-on signals. We should note that while it may be convenient to regard the sequencesof binary digits so used as binary numbers,the numerical value of the binary number has no particular significance;we can chooseany binary number to representa particular decimaldigit. Encoding and Binary Digits 77 If we use 10of the 16possible 5-digit binary numbersto encode the 10 decimaldigits, we never use (we waste)6 binary numbers. We could, but neverdo, transmit thesesequences as sequencesof 'We off-or-on signals. can avoid such waste by means of block coding, in which we encode sequencesof 2, 3, or more decimal digits or other charactersby meansof bin ary digits. For instance, all sequencesof 3 decimaldigits can be representedby l0 binary digits, while it takesa total of 12 binary digits to representsepa- rately eachof 3 decimaldigits. Any sequenceof decimal digits may occur, but only certain sequencesof Englishletters ever occur, that is, the words of the Englishlanguage. Thus, it is more efficientto encodeEnglish words as sequencesof bin ary digits rather than to encodethe lettersof the words individually. This again emphasizesthe gain to be made by encodingsequences of characters,rather than encodirg each characterseparately. All of this leadsus to the idea that there may be a best way of encoditg the messagesfrom a messagesource, a way which calls for the least number of binary digits. CHAPTER Entropy

IN rnn LASTcHAprER, we have consideredvarious waysin which messagescan be encodedfor transmission.Indeed, all communica- tion involves some sort of encoding of messages.In the electrical case,letters may be encodedin terms of dots or dashesof electric current or in terms of several different strengths of current and directions of current flow, as in Edisonosquadruplex telegraph.Or we can encodea messagein the binary languageof zerosand ones and transmit it electrically as a sequenceof pulses or absences of pulses. Indeed, we have shown that by periodically samplinga continu- ous signal such as a speechwave and by representingthe ampli- tudes of each sample approximately by the nearest of a set of discretevalues, we can representor encodeeven such a continuous wave as a sequenceof binary digits. We have also seenthat the number of digits required in encoding a given messagedepends on how it is encoded.Thus, it takesfewer binary digits per character when we encode a group or block of English letters than when we encode the letters one at a time. More important, becauseonly a few combinations of lettersform words, it takes considerably fewer digits to encode English text word by word than it doesto encodethe same text letter by letter. Surely, there are still other ways of encoding the messagesPro- duced by u particular ergodic source,such as a sourceof English text. How many binary digits per letter or per word are really needed?Must we try all possible sorts of encoditg in order to find 78 Entropy 79 out? But, if we did try all forms of encoding we could think of, we would still not be sure we had found the best form of encoding, for the best form might be one which had not occurred to us. Is there not, in principle at least, some statisticalmeasurement we can make on the messagesproduced by the source,a measure which will tell us the minimum averagenumber of bin ary digits per symbol which will serveto encodethe messagesproducedby the source? In consideringthis matter, let us return to the model of a mes- sagesource which we discussedin Chapter III. There we regarded the messagesource as an ergodicsource of symbols,such as letters or words. Such an ergodic sourcehas certain unvarying statistical Properties:the relativefrequencies of symbols; the probability that one symbol will follow a particular other symbol, or pair of sym- bols, or triplet of symbols;and so on. In the caseof English text, we can speak in the same terms of the relative frequenciesof words and of the probability that one word will follow a particular word or a particular pair, triplet, or other combination of words. In illustrating the statisticalproperties of sequencesof letters or words, we showedhow material resembling English text can be produced by a sequenceof random choices among letters and words, provided that the letters or words are choJen with due regard for their probabilities or their probabilities of following a p.receditg sequenceof letters or words. fn these examples, itre throw of a die or the picking of a letter out of a hat can oochoose" serveto the next symbol. In writing or speaking,we exercisea similar choice as to what we shall set down or say next. Sometimeswe have no choice; Q must be followed by u. We have more choice as to the next symbol in beginnitg a word than in the middle of a word. How- ever, in any messagesource, living or mechanical,choice is con- tinually exercised.Otherwise, the messagesproduced by the source would be predeterminedand completely predictable. Correspondingto the choice exercisedby the messagesource in producing the message,there is an uncertainty on tht part of the recipient of the message.This uncertainty is resolved when the recipient examinesthe message.It is this resolution of uncertainty which is the aim and of communication. 80 Symbols, Signals and Noise If the messagesource involved no choice, if, for instance,it could produce only an endlessstring of ones or an endlessstring of zeros,the recipient would not need to receiveor examinethe messageto know what it was; he could predict it in advance.Thus, if we are to measureinformation in a rational way, we must have a measurethat increaseswith the amount of choice of the source and, thus, with the uncertainty of the recipient as to what message the sourcemay produce and transmit. Certainly, for any messagesource there ate more long messages than there are short messages.For instance,there are 2 possible rnessagesconsisting of I binary digit, 4 consisting of 2 binary digits, 16 consistingof 4 binary digits, 256 consistingof 8 binary digits, and so on. Should we perhapssay that amount of informa- tion should be measuredby the number of such messages?Let us consider the caseof four telegraph lines used simultaneouslyin transmitting binary digits betweentwo points, all operatingat the same speed.Using the four lines, we can send 4 times as many digits in a given period of time as we could using one line. It also seemsreasonable that we should be able to send 4 times as much information by using four lines. If this is so, we should measure information in terms of the number of binary digits rather than in terms of the number of different messagesthat the binary digits can form. This would mean that amount of information shouldbe measured,not by the number of possible messages,but by the logarithm of this number. The measureof amount of information which communication theory provides does this and is reasonablein other ways as well. This measureof amount of information is called entropy.If we want to understandthis entropy of communication theory, it is bestfirst to clear our minds of any ideas associatedwith the entropy of physics.Once we understandentropy as it is used in communica- tion theory thoroughly, there is no harm in trying to relateit to the entropy of physics, but the literature indicates that some workers have never recoveredfrom the confusion engenderedby an early admixture of ideas concerning the entropiesof physics and communication theory. 'We The entropy of communication theory is measuredtn bits. may say that the entropy of a message source is so many bits per Entropy 8l letter,or per word, or per message.If the sourceproduces symbols at a constant rate, we can say that the sourcehas an entropy of so many bits per second. Entropy increasesas the number of messagesamong which the sourcemay chooseincreases. It also increasesas the freedom of choice(or the uncertaintyto the recipient) increasesand decreases as the freedom of choiceand the uncertainty are restricted.For instance,arestriction that certainmessages must be senteither very frequently or very infrequently decreaseschoice at the sourceand uncertainty for the recipient, and thus such a restriction must decreaseentropy. It is best to illustrate entropy first in a simple case.The mathe- matical theory of communicationtreats the messagesource as an ergodicprocess, a processwhich producesa string of symbolsthat are to a degreeunpredictable. We must imaginethe messagesource as selectinga given messageby some randoffi, i.e., unpredictable means,which, however,must be ergodic.Perhaps the simplestcase we can imagine is that in which there are only two possiblesym- bols, sa/, X and Y, betweenwhich the messagesource chooses repeatedly,each choice uninfluencedby any previous choices.In this case we calLknow only that X will be chosen with some probability po and X with someprobability pt, as in the outcomes of the toss of a biased coin. The recipient can determine these probabilities by examining a long string of characters(X's, f's) produced by the source.The probabilities po and pt must not changewith time if the sourceis to be ergodic. For this simplestof cases,the entropy H of the messagesource is defined as - - H (po log po * h log p) bits per symbol Thus, the entropy is the negativeof the sum of the probabrhtypo that X will be chosen(or will be received)times the logarithm of Po and the probability rrthat Y will be chosen(or will be received) times the logarithm of this probability. Whatever plausible arguments one may give for the use of entropy as definedin this1nd in more compliiated cases,the real and true reason is one that will become apparent only as we proceed, and the justification of this formula for entropy will 82 Symbols, Signalsand Noise there therefore be deferred. It isohowever, well to note again that are different kinds of logarithms and that, in information theory, to we use logarithms to the base 2. Some facts about logarithms the base 2 arenoted in Table X.

Tnnrn X

Another WaYof Sttll Another WaY Log p Fraction p Writing p of Writing P

1 I 2-.4L5 -.415 4 2.4r5 1 I 2-r -1 , 2r

3 I 2-L.4r5 - 1.415 6 2L.4r5

I I 2-2 -2 4 22

I I 2-3 -3 E 23

I I 2-4 -4 G 24

I I 2-6 -6 a 26 I I 2-8 -8 ffi F

which The logarithm to the base z of a number is the power to 2 must G raised to give the number' consists Let us consider,for instan ce, a'omessage source" which Iet x representheads and of the tossing-tails. of an honest coin. we can turn up y represent The probability pr that the coin will up lails heads is Vz andthe profubility pi iuat the coin will turn and from is also yz. Accordingly, from o"i expressionfor entropy Table X we find that H - -(YzlogVz+ Vzlogh) H - -t(Yr)(-l) + (vr)(-l)l H - 1 bit per toss Entropy 83

If the messagesource is the sequenceof headsand tails obtained by tossinga coin, it takesone bit of information to conveywhether headsor tails has turned up. Let us notice,now, that we can representthe outcome of succes- sively tossing a coin by a number of bin ary digits equal to the number of tosses,letting I stand for headsand 0 stand for tails. Hence,in this caseat least, the entropy, one bit per toss,and the number of binary digits which can represent tfie outcome, one binary digit per toss,are equal. In this caseat least, the number of bin ary digits necessaryto transmit the messagegenerated by the source(the successionof headsand tails) is equat to the entropy of the source. Supposethe messagesource produces a string of I's and 0's by tossing a coin so weightedthat it turns up heads3/q of the time and tails only t/qof the time. Then

3/q Pt = Po=Vq - H - (Velog tA + 3/nlog 3/+) H - -t(vo)(-2) + (ro)(-.415)I H - .81I bit per toss We feel that, in the caseof a coin which turns up headsmore often than tails, we know more about the outcome than if heads or tails were equally likely. Further, if we were constrained to chooseheads more often than tails we would have lesschoice than if we could chooseeither with equal probability. We feel that this must be so, for if the probability for headswere I and for tails 0, we would have no choice at all. And, we seethat the entropy for the caseabove is only.Sll bit per toss.We feel somehowtliit we gught to be able to representthe outcome of a sequenceof such biasedtosses by fewer than one binary digit per toJs,but it is not immediately clear how many binary digitJ we must use. If we chooseheads over tails with probability p1, the probability of choosingtails must - Po of coursebe I pr.Thus, if we know 7r1 we-know/o as well. We can compute H for various valuesofp, Td plot a graphof ^fIvs./r.Suclr,a curveis shownin FigureV-1. ^F/has a maximum value of I whenpr is 0.5 and is 0 whJn pr is 0 or l, that is, when it is certain that the messagesource ul*uyt produceseither one symbol or the other. 84 Symbols, Signals and Noise

o E o

F I

PooR P' Fig. V-r r and tails Really, whether we call headsX and tails r or heads the sameas H X is immaterial, so the curve of H vs' pt must be about the vs. po.Thus, the curve of Figure v- I is symmetrical dashed centerline at pt andlo eQualto 0.5. amongthe A messagesource -uy pioduce successivechoices of the alphabet' ten decimal-digits,or u-ottg the twenty-sixletters English language' or among the many thousairdsof words of the producesone Let us considerthe casein which the messagesou{ce are independ- amon nsymbolsor words, with probabilites which E is definedas ent of preuio,r, choices.In this tute the entropy n (5.1) H_ symbol i:l up various terms. Here the sign ) (sigma)means to sum or to add Entropy 85 Ptis the probability of the i th symbol being chosen.The i - I below and n abovethe ) meanto let i be I ,2,3,etc. up to n, so the equa- tion says that the entropy will be given by ad-dingh 1ogpt and PzlogPz and so on, includitg all symbols.We seethat when n = 2 we have the simple casewhich we consideredearlier. Let us take an example.Suppose, for instance,that we tosstwo coins simultaneously.Then there are four possibleoutcomes, which we can label with the numbers I through 4: H Hor I H Tor2 T Hor3 T Tor4 If the coins arehonest, the probability of each outcome ts rAand the entropy is - t/a H - (v4 rog + vnlog y4+ t/nlog y4+ ve log va) H - - ( -Yz -Y2 -t/2 -k) H - 2 bits per pair tossed It takes 2 bits of information to describeor convey the outcome of tossi.g a pair of honestcoins simultaneously.As in the caseof tossing one coin which has equal probabilities of landirg headsor tails, we can in this casesee that we can use 2 binary digits to describethe outcomeof a toss: we can use I binary digit fo; each coin. Thus, in this casetoo, we can transmit the messagegenerated 9y ltt. process(of tossingtwo coins) by using a numblrbf binary digits equal to the entropy. If we have somenumber n of symbols all of which are equally probable, the probability of any particular one turning up ii t t i, so we have n terms,each of which is | / n log | / n. Thus, the entropy is in this case H - -log l/n bits per symbol For instance,an honestdie when rolled has equal probabilities of turning up any number from I to 6. Hence, the entropy of the sequenceof numbersso produced must be log Yo,or 2.58bits per throw. More generally,suppose that we chooseeach time with equal 86 SYmbols,Signals and Noise 2N likelihood among all bin ary numbers with N digits. There are such numbers,so n: 2N

From Table X we easilYsee that log l/n = log 2-N- -N Thus, for a sourcewhich producesat eachchoice with equallikeli- ttlm- hood some,nf-digit binary number, the entropy is N bits per ber. Here the ,rrersug. pioduced by the source is a binary number which can certainly-be representedby bin ary digits. And, again, equ1l the messagecan be representedby a number of bin arydigits to the .r#opy of the message,measured in bits. This example illustrates giapktically how itte logarithm must be the correct mathematical function in the entropy' Ordinarily the probability that tlie messagesource will produce take a particulai ry-bol is different for different symbols.Let us as an .*u*pi. a messagesource which produces English-y-?ldt independen^tlyof what hur gone before but with the probabilities characteristii of English prolr.. This colrespondsto the first-order word approximation given in Chapter-Ill. if In the caseof Engfisn prose, we find as an empirical fact that the we order the words u"rotding to frequencyof usage'so that word most frequentlyused, the mosi probaute-word(the, in fact ) is so number l, the next most probable word @n is number 2, and (if r is not or, then the probability for the rth word is very nearly too large) pr: 'llf (5.2) which If equation 5.2 were strictly true, the points-in Figure v-2, in or *ord probability or frequencypr LSptotted againstword order rank r, would fiU on the solid lit e which extends from uppel feft to lower right. we seethat this is very nearly so. This empirical inverserelalion betweenword probability and word rank is known 's wo as Zipf law. We will discuss Ztpfoslaw in Chapter XII; here, proposemerelY to use it. We can show that this equation (5.2) cannot hold for all words. To seethis, let us considertossittg a coin. If the probability of heads Entropy 87

ts .ol t+I J o o o c o. &, o - .ool F- SAY U z lrJ f ct lrj C, tr o - OUALITY 5 .ooof 3

ro foo WORDORDER Fig. V-2 turning up is Vzand the probability of tails turnirg up is Vz,then there is no other possibleoutcome : Vz+ yz _ l. If there were an additional probability of tAothat the coin would stand on edge,we would have to concludethat in a hundred tosseswe would &pect I l0 outcomes:heads 50 times, tails 50 times,and standingon edge l0 times.This is patently absurd.Thp probabilities of all outcomes must add up to unity. Now, let us note that if we add up succes- sively/r plusplt etc.,as givenby equation5.2,we find that by the time we came to pant the sum of the successiveprobabilities has becomeunity. If we took this literally, we would conclude that no additional word could ever occur. Equation 5.1 must be a little in elTor. Nonetheless,the error is not great,and Shannonused equation 88 SYmbols,Signals and Noise 5.2 tncomputing the entropy of a messa}e sourcewhich produr:t in words independEntlybut *i[tr the proba6ilty of their.occurring of all Englistr teit. In order to make the sum of the probabilities words. *oid, unity, he included onry the 9,727most frequentlyused He tound the entropyto be 9.14bits per word. letter In chapter Iv, we saw that English text can be encoded digits by lett., uy using 5 binary digits per characteror 27.5 binary of pL, word. we ulso saw ittuiuy pro-viding different sequences we could birrury digits for each of 16,35iwords and 27 characters, word. we .nrode nn$ish text by using about 14 bin ary digits q.t digits are now bEginning to suspect that the number of binary seen' actually req-uiredIs given by the entrop), and, as we have English Shannon,sestimate, 6ased on the relative probabilitiesof words,would be g.L4 binarydigits Per word. As a next step in exploring ittir matter of the number of binary message digits required to encode the messageproduced by a proved source,wo will considera startling theorem which Shannon which concerning the "messages"produced by an ergodic source certain selectsu ,Eq,renceof lelterJor words independently with probabilities. which Letus considerall of the messagesthe sourcecan produce exam- consist of someparticular large number of characters.For symbols ple, we might clnsider all rnessag:swhich ate 100,000 (tetterr, *Jrds, ch aracters)long. Mot. generally, let us consider messageshaving a number M ofcharacters.Some of thesemessages symbol aremore prouaule than others.In the probable messagel, etc. I occurs about Mp, times, symbol 7 occursabout Mp, times, about Thus, in ther. probable meisageseach symbol occurswith pro- the frequency ,huru.teristic of the source.The sourcemighl of duce other sortsof messages,for instance,a message consisting which the one symbol endlesslyrep-eated or merely a messagein M times numbers of the various symbols differed markedly from their probabilities,but it seldom does. per The remarkablefact is that, 1fH is the entropy of the source the rest symbol, there are just about zMHprobable messages,and of ever of the messagesuU have vanishingty small probabilities from most occurrirrg. triottrer words, if we ranked the messages of MH probabl; to least probable, and assignedbinary numbers Entropy 89 digits to the 2MHmost prbbable messag€s,we would be almost certain to havea number correspondingto any M-symbolmessage that the sourceactually produced. Let us illustrate this in particular simple cases.Suppose that the symbols produced are I or 0. If these are produced with equal probabilities, a probability Vzthat for I and a probability Vzthat for 0 the entropy H is, as we have seen,I bit per symbol. Let us let the sourceproduce messages M digitslong. Then MH = 1,000, and, accorditg to Shannon'stheorem, there must be 2rooodifferent probablemessages. Now, by using 1,000binary digits we can writejust zLooodifferent binary numbers.Thus, in order to assigna different bin ary num- ber to eachprobable message,we must use binary numbers 1,000 digits long. This is just what we woulclexpect. In order to desig- nate to the messagedestination which 1,000digit binary number the messagesource produces, we must senda message 1,000 binary digits long. But, supposethat the digits constituting the messagesproduced by the messagesource areobtained by tossinga coin which turns up heads,designating l, 3/qof the time and tails, designating0, r/+ of the time. The typical messagesso produced will contain more I's than 0's, but that is not all. We have seenthat in this casethe entropyH is only.8ll bit per toss.If M, the lengthof the message, is again taken as 1,000binary digits,MH is only 8ll. Thus,while as before there are 2rooopossible messages,there ate only 28LL probablemessages. Now, by using 8l I binary digits we can write 28Lr different binary numbers, and we can assign one of these to each of the 1,000-digitprobable messages, leaving the other improbable 1,000- digit messagesunnumbered. Thus, we can send word to a message destinationwhich probable 1,000-digit message our messagesource producesby sendingonly 81I binary digits. And the chancethat the messagesource will produce an improbable 1,000-digitmes- sage, to which we have assignedno number, is negligible. Of course,the schemeis not quite foolproof. The messagesource may still very occasionallyturn up a messagefor which we have no label among all 2a11of our 81l-digit binary labels. In this casewe can- not transmit the message-at least,not by using 8l I binary digits. 90 SYmbols,Signals and Noise We seethat again we have a strong indication that the number just the entropy of bin ary digits iequired to transmit a message1s might in bits per slmbol times the number of symbols.And, we note that in this last illustration we achievedsuch an economical (or some transmissionby block encoding-that is, by lumping 1,000 each other large ,r.,rnu.r) messageaigitr togetherand representing binary probablJcombination of digits by its individual code(of 811 digits).-Ho* firmly and generally can this suppositionbe established? So far we have considered only casesin which the message sourceproduces each symbol (number,letter, word) independently not true of the iymbols it has produced before.we know this is there for Engiish text. Besid^esthe constraintsof word frequency, choiceas areconstraints of word order, so that the writer has less it to what the next word will be than he would if he could choose independentlyof what has gone before. in the How are we to handle ittir situation? We have a clue IV and which hasbeen block coding-our which we discussedin chapter ergodic brought to mind again in the lait example. In an on the pro"Jrs the probability Jr tn. next letter may depe-1donly letters.The precedingl,'2,3,4,5, lr more lettersbut not on earlier in chapter secondand third order approximations to English glven in any Iil illustrate text produced by such a process. Indeed, sensethe ergodic processof which we are to make mathematical next must effect of the past on what symbol will be progl:.d is reasonably decreaseas the remotenessof ihat past is greater.This imagine valid in the case of real English as well. while we can name for examplesto the contrary (the consistentuse of the same doesnot a character in a novel),-in general the word I write next depend on just what word I wrote 10,000words back' it up Now, ,rripose that before we encodea message we divide enough, into very rfirg blocks of symbols. If the blocks arelong in the only thasymdols near the beginning -ill on symbo-ls -block, $epe1d these pr.lio.r, and, if we hake ltre block long enough,- block will symbols that do depend on in the previous -symbols This makes form a negligible puit of all th; symbolsin the block. uy it possiblJfor us to compute the entropy per bl2ck of symb9l. us call the means of equation 5.1. T" keep matters straight, let Entropy 9l probability of a particular one of the multitudinous long blocks of symbols,which we will call the i th block, P( Br). Then the entropy per block will be

H- i Any mathematicianwould object to calling this the entropy. He would sa/, the quantity H given by the aboveequati on approaches the entropy as we make the block longer and longer, so that it includesmore and more symbols.Thus, we must assumethat we make the blocks very long indeed and get a very close approxima- tion to the entropy. With this proviso, we can obtain the entropy per symbol by dividing the entropy per block by the number N of symbolsper block H - - ( r/N) i In general,an estimateof entropy is alwayshigh if it fails to take into accountsome relations between symbols. Thus, as we make N, the number of symbolsper block, greater and greater,H as given by 5.3 will decreaseand approachthe true entropy. We haveinsisted from the start that amount of information must be so definedthat if separatemessages are sent over severaltele- graph wires, the total amount of information must be the sum of the amountsof information sent over the separatewires. Thus, to get the entropy of severalmessage sources operating simultane- ously, we add the of the separatesources. We can go further and say that if a source operatesintermittently we must multiply its information rate or entropy by the fraction of the time that it operatesin order to get its averageinformation rate. Now, let us say that we have one messagesource when we have just sent a particularsequence of letters such as TH. fn this case the probability that the next'letter will be E is very high. We have another particular messagesource when we havejust sent NQ. In this casothe probability that the next symbol will be U is unity. We calculatethe entropy for each of thesemessage sources. We multiply the entropy of a sourcewhich we label Bi by the proba- bility p@) that this sourcewill occur (that is, by the fraciion of 92 SYmbols,Signals and Noise instancesin which this sourceis in operation). we multiply the entropy of each other sourceby the probability that that source will occur, and so on. Then we add afl the numbers we get in thi-: way in order to get the average entropy or rate of the over-all each ,oor.r, which is a combination of the many different sources' of which operatesonly part time. As an example,consider a source the involving digtu- ptoUuUilitiesonly, so that the whole effectof will be past is simried .rp itr the letter last ptojtced. One source the sourcewe havewhen this letterls E; this will occurin.13 of when the total instances.Another sourcewill be the sourcewe have total the letter just produced is W this will occur in .02 of the instances. u putting this in formal mathematical terms, we say that lf j,rlt particulai biock of l/ symbols,,whichwe designateby B, has is occurred, the probabiliiy that the next symbol will be symbol si Pnr(Si) The entropy of this "source" which operatesonly when a partigu- is lar block of N symbolsdesignated by Bohas just beenproduced --\ z,- r si) log pnr(si) L.t'6i\- l But, in what fraction of instancesdoes this particular message source operate? The fraction of instances in which this source block operates is the fraction of instancesin which we encounter Bi rather than some other block of symbols; we call this fraction P@') Thus, taking into account all blocks of N symbols, we write the sum of the lntropies of all the separatesources (each separate has source defined bi what particurai block Bi of N symbols preceded the choice of the symbol si) as Hrr - -Zp(B)p"t(si) logPnt(si) (5.4) I,J The 1,7under the summation sign rnean to let i and/ assumeall possible values and to add all the numbers we get in this way. As we let the number N of symbolsprecedittg symbol S, become very large,Hw approachesthe entropy of the source.If there are Entropy 93 no statisticalinfluences extendirg over more than l/ symbols(this will be true for a digram source for N - I and for a trigram sourcefor N - 2), then Hx is the entropy. Shannonwrites equation 5.4 a little differently. The probability p(8" Si) of encounteritg the block ^Bifollowed by the symbol Si is the probability p@) of encountering the block Bi times the probability pnn(Si) that symbol Si will follow block Bi. Hence,we can write 5.4 as follows: rrt - - Hx Zp(B,, Si) log pni(S,) ij In Chapter III we considera finite-statemachine, such as that shownin Figure III-3, &Sa sourceof text. We can,if we wish, base our computation of entropy on such a machine. In this case,we regard eachstate of the machineas a message source and compute the entropy for that state.Then we multiply the entropy for that state by the probabilitl'that the machine will be in that stateand sum (add up) all statesin order to get the entropy. Putting the matter symbolically,suppose that when the machine is in a particular statei it has a probability pr(j) of producing a particular symbolwhich we designate by j.For instance,in a state labeled, - l0 it might have a probability of 0.03of producing the third letter of the alphabet,which we labelj - 3. Then

Pto(3) The entropyHt of statei is computedin accordwith 5.1:

Hi: j Now, we say that the machine has a probability Pt of being in the ith state. The entropy per symbol for the machine as a sourceof symbolsis then

H_ i We can write this as - - \(-.l H ZPtp{i) logp,U) bitsper symbol (5.5) ij 94 SYmbols,Signals and Noise pr is the probability that the finite-statemachine is in the ith state, and po(ji is the probability that it producestheitn sy*bol when it is in the i th stite. The i andT under the X mean to allow both i and j to assumeall possiblevalues and to add all the numbersso obtained. Thus, we have gone easily and reasonablyfrom the entropy of a source which ploduces Jymbols independently uld to which equation 5.1 uppfi.s to the more difficult casein which the proba- Uitity of a rytttUol occurring dependson what hasgone before. And, *. huve three alternativehefttods for computing or defi.ningthe entropy of the messagesource. These three methods are equivalent and tigotously cotreCtfor true ergodic sources.We should remem- ber, of courr., that the sourceof Engtishtext is only approximately ergodic. brrc. having defined entropy per symbol in a perfectly general wa), the ptoblem is to relate it unequivocally to the average numb.t of binary digits per symbol necessaryto encodea message- We have seenit uiif we divide the messageinto a block of letters or words and treat each possibleblock as a symbol,we can com- pute the entropy per bfock by the same formula we used per independent sy*Uot and get as close as we like to the source entropy merely by making the blocks very long. Th;;, the ptobl.- is to find out how to encode efficiently in binary digits u r.qrrenceof symbols chosenfrom a very large grouP of symbols,each of *nich hasa certain prybabilityof beingchosen. Shannon and Fano both showedways of doing this, and Huffman found an even better way, which we shall considerhere. Let us for conveniencelist all the symbols vertically in order of decreasingprobability. S,rpposethe symbols are the eight words the, man, to^,runs, hoise, likis, horse, sells,which occur independ- ently with probabilities of their being chosen, or appearing, as listed in Table XI. We can computethe entropy per word by meansof 5.1; it ts 2.21 bits per word. However, if we-merely assignedone of the eight 3-digit binary numbers to each word, we would need 3 digits to transmit eactrword. How can we encodethe words more efficiently? Figure V-3 showshow to construct the most efficientcode for .rrroling such a message word by word. Thewords arelisted to the Entropy 95 Tann XI

Word Probabilitv/ the .50 man .15 to .t2 runs .10 house .04 likes .04 horse .03 sells .02

left, and the probabilities are shown in parentheses.In construct- irg the code,we first find the two lowest probabilities, .02 (sells) and .03 (horse),and draw lines to the point marked .05, the prob- ability of either horseor sells. We then disregard the individual probabilities connectedby the lines and look for the two lowest probabilities,which are .04 (like) and .04 (house).We draw lines to the right to a point marked .08,which is the sum of .04 and.04. The two lowestremainirg probabilities arenow .05and .08,so we draw a line to the right connecting them, to give a point marked

THE (.SO) (r.oo) MAN ( .t 5)

TO (,121

RUNS ( .t o)

HOUSE ( .OA;

LtKE (.o4)

HORSE ( .O3)

SELLS ( ,OZ1 (.os) Fig. V-3 96 Symbols,Signals and Noise .13.We proceedthus until paths run from eachword to a common point toihe right, the point marked 1.00.We then label eachupp-er pattr going tolhe left from a point 1 and each lower path 0. The lode ior i gt word is then the sequenceof digits encountered going left fto-"n the common point 1.00to the word in question. The codes are listed in Table XII.

Tanrn XII

Number of Digits Code Np Word Probabilityp in Code, N

the .50 I I .50 man .15 001 3 .45 to .t2 0ll 3 ,36 runs .10 010 3 .30 house .04 0001I 5 .20 likes .04 00010 5 ,20 horse .03 00001 5 .15 .10 sells .02 00000 5 n6

In Table XII we have shown not only each word and its code but also the probability of each code and the number of digts in each code. 1.n. probaUitity of a word times the number of digits in the code gives the averagenumber of digits per y:td in a long message due to the use of that particular word. If we add the prodults of the probabilities and ttre numbers of digits for all the words, we get th; averagenumber of digits per w-ord,which rs2.26. This is a liitle larger than the entropy per word, which we found to be Z.Zl bits pd word, but it is a imaller number of digits than the 3 digits per word we would have usedif we had merely assigned a different 3-digit code to each word. Not only rui it be proved that this Huffman code is the most efficient code for .n.obirg a set of symbolshaving different prob- abilities, it can be proved that it always calls for less than one binary digit per symbol more than the entropy (in the above .ru-ple, It cults for only 0.05 extra binary digits p91 sylbo^l). Now supposethat wqcombine our symbolsinto blocksof 1,2, 3, or ,r,or.^symbolsbefore encoding. Each of theseblocks will have Entropy 97 a probability (in the caseof symbols chosenindependently, the probability of a sequenceof symbols will be the product of the probabilities of the symbols).We can find a Huffrnan codefor these blocks of symbols.As we make the blocks longer and longer, the number of binary digits in the code for each block will increase. Yet, our Huffman codewill take lessthan one extra digit per block abovethe entropy in bits per block! Thus, &Sthe blocks and their codesbecome very long, the less-than-oneextra digit of the Huff- man code will becomea negligible fraction of the total number of digits, and, as closelyas we like (by making the blocks longer),the number of binary digits per block will equal the entropy in bits per block. Supposewe have a communication channel which ban transmit a number C of off-or-on pulsesper second.Such a channel can transmit C brnaqydigits per second.Each binary digit is capable of transmitting one bit of information. Hence we can say that the information capacityof this communication channel is C bits per second.If the entropyH of a messagesource, measured in bits per second,is lessthan C then,by encodirg with a Huffman code,the signalsfrom the sourcecan be transmitted over the channel. Not all channelstransmit binary digits. A channel,for instance, might allow three amplitudesof pulses,or it might transmit differ- ent pulsesof different lengths,as in Morse code. We can imagine connectingvarious different messagesources to such a channel. Each source will have some entropy or information rate. Some sourcewill give the highestentropy that can be transmitted over the channel,and this highestpossible entropy is called the channel capacityC of the channel and is measuredin bits per second. By meansof the Huffman code, the output of the channelwhen it is transmitting a message of this greatestpossible entropy can be coded into someleast number of binary digits per second,and, when long stretchesof messageare encodedinto long stretchesof binary digits, it must take very close to C binary digits per second to representthe signalspassing over the channel. This encoding can,of course,be used in the reversesense, and C independentbin ary digits per second can be so encodedas to be transmitted over the channel. Thus, a source of entropy H can be encodedinto f/ binary digits per second,and a generaldiscrete 98 Symbols,Signals and Noise channel of capacity C can be used to transmit C bits pel second- We aretto* in a position to appreciate one of the fundamental theorems of information theory. Shannon calls this the funda- mental theorem of the noiselesschannel. He statesit as follows: Let a sourcehave entropy H (bits per symbol)and a channelhave a the capacity[to transmit]C biis per second.Then it is possibleto encode ousputof the sourcein sucha way as to transmitat the averagerate (C/H) - € symbolsper secondover the channel,where e is arbitrarily small.It is ttbt possibleto transmitat an averagerate greater than C/H-

Let us restatethis without mathematical niceties.Any discrete channel that we may specify, whether it transmits binary digits, lettersand numbers,or dots,dashes, and spacesof certaindistinct lengths has some particular unique c: Any .rg6aic message source has someparticular entropy H- If H is less thin or equal io C we can transmit the messagesgenerated by the source over the channel. If f/ is greater than C we had better not try to do so, becausewe just plain can't' We have indicated above how the first part of this theoremcan H be proved. we have not shown that a sourceof entropy -c.annot be encodedin less than H brnary digits per symbol, but this also can be proved. u We haue now firmly arrived at the fact that the entropy 9{ messagesource -.ur.tred in bits tells us how many binary digits (or offIor-on pulses,or yeses-or-noes)are required, per charactet, o, per letter, or per word, or per second in order to transmit messagesprodu.iO by the sourie. This identification goesright back to Shannon'soriginal paper. In fact, the word bit is merely a contraction of binaiy digit lnd is generally used in place of binary digit. Here I have used bit in a particular sense,as a measureof amount of information, and in bttrer contexts I have useda differ- ent expression,binary digit. I have done this in order to avoid a confusion which might rality have arisenhad I startedout by using bit to mean two different things. After all, in practical situations the entropy-in bits is usually different from the number of binary digits involved. Suppose,for instance,that a messagesource rando*ty producesthe symbol 1 Entropy 99 with a probability t/qand the symbol 0 with the probability 3/qand that it producesl0 symbolsper second.Certainly such a source producesbinary digits at a rate of 10 per second,but the informa- tion rate or entropy of the sourceis .81I bit per binary digit and 8.1I bits per second.We could encodethe sequenceof binary digits produced by this sourceby using on the averageonly 8.11binary digits per second. Similarly, supposewe have a communication channel which is capableof transmitting 10,000arbitrarily chosenoff-or-on pulses per second.Certainly, such a channel has a channel capacityof 10,000bits per second.However, if the channelis usedto transmit a completely repetitive pattern of pulses,we must say that the actual rate of transmissionof information is 0 bits per second, despitethe fact that the channel is certainly transmitting 10,000 binary digits per second. Here we have usedbit only in the senseof a binary measureof amount of information,as a measureof the entropy or information rate of a messagesource in bits per symbol or in bits per second or as a measureof the information transmission capabilitiesof a channel in bits per symbol or bits per second.We can describeit as an elementarybinary choiceor decisionamong two possibilities which have equal probabilities.At the messagesource a bit repre- sentsa cefiain amount of choice as to the messagewhich will be generated;in writing grammatical English we have on the average a choice of about one bit per letter. At the destination a bit of information resolvesa certain amount of uncertainty; in receiving English text there is on the average,about one bit of uncertainty as to what the next letter will be. When we are transmitting messagesgenerated by an information sourceby meansof off-or-on pulses,we know how many binary digits we are transmittingper secondeven when (as in most cases) we don't know the entropy of the source.(If we know the entropy of the sourcein bits per secondto be lessthan the binary digits used per second,we would know that we could get along in prin- ciple with fewerbinary digits per second.)We know how to usethe binary digits to specifyor determineone out of severalpossibilities, either by meansof a tree such as that of Figure IV-4 or by means of a Huffman codesuch as that of Figure V-3. It is common in such r00 Symbols, Signals and Noise a caseto speakof the rate of transmissionof binary digits as a bit rate, but there is a certain danger that the inexperiencedmay muddy their thinking if they do this. All that I really ask of the reader is to remember that we have used bit rn one senseonly, as a measureof information and have called 0 or I a binary digit. If we can transmit 1,000freely chosen binary digits per second,we can transmit 1,000bits of information a second.It may be convenient to use bit to mean binary digit, but when we do so we should be sure that we understand what we are doing. Let us now return for a moment to an entirely different matter, the Huffman code given in Tbble XII and Figure V-3. When we encode a messageby using this code and get an uninterrupted string of symbols,how do we tell whether we should take a particu- lar I in the string of symbols as indicating the word the or as part of the code for some other word? We should note that of the codesin Table XII, none forms the first part of another. This is called the prefix property. It has important and, indeed, astonishingconsequences, which are easily illustrated. Suppose,for instance,that we encodethe rnessage:the man sells the house to the man the horse runs to the maII. The encodedmessage is as follows: lrn'l manI 'ers lrn'l house 100100 00010 00 I I t I likes man Itn.I I tl I to the I man the horse I 0l I I 001 I 00001 to theI man the horse I I runs to the I man I 010 011 I 001 runs to the I tttutt I Here the messagewords are written above the code grouPs. Entropy l0l

Now supposewe receiveonly the digits followitg the first vertical dashedline below the digits.We start to decodeby looking for the shortestsequence of digits which constitutesa word in our code. This is 00010,which correspondsto likes. We go on in this fashion. The o'decoded"words are written under the code,separated by dashedlines. We seethat after a few errors the dashedlines correspondto the solid lines, and from that point on the deciphered messageis correct.We don't needto know wherethe messagestarts in order to decodeit as correctlyas possible(unless all code words are of equallength). When we look back we can seethat we have fulfilled the purpose of this chapter.We have arrived at a measureof the amount of information per symbolor per unit time of an ergodicsource, and we have shownhow this is equal to the averagenumber of binary digits per symbol necessaryto transmit the messagesproduced by the source.We have noted that to attain transmissionwith neg- ligibly more bits than the entropy, we must encodethe messages pioduced by the sourcein long blocks, not symbol by symbol. We might ask,however, how long do the blocks have to be?Here we comeback to anotherconsideration. There are two reasonsfor encodirg in long blocks. One is, in order to make the average number of bin ary digits per symbol used in the Huffman code negligibly larger than the entropy per symbol. The other is, that to encodesuch material as English text efficiently we must take into accountthe influenceof preceditg symbolson the probability that a given symbol will appearnext. We have seenthat we can do this using equation5.3 and taking very long blocks. We return,then, to the question:how many symbolsl/ must the block of charactershave so that (1) the Huffrnan code is very efficient, (2) the entropy per block, disregarding interrelations outside of the block, is very close to l/ times the entropy per symbol?In the caseof English text, condition 2 ts governing. Shannonhas estimatedthe entropy per letter for English text by measuringa personosability to guessthe next letter of a message after seeingl, 2, 3, etc., preceditg letters. In these texts the "alphabet" usedconsisted of 26 lettersplus the space. Figure V-4 showsthe upper and lower bounds on the entropy of English plotted vs. the number of letters the person saw in r02 Symbols, Signals and Ir{oise

o o

o lE

\t

o

N

=o tr UJ ot- _l FF lU J -T or lL o\ ts tl @uJ bb O i:'- l t-f l^ z l^ t- I I F5 I 3to / 1ilI uzCO / / =f, 60 ,/ fo / J // -2,1 stlg Entropy 103 making his prediction.While the curve seemsto drop slowly as the number of lettersis increasedfrom l0 to 15,it drops substan- tially between 15 and 100.This would appear to indicate that we might have to encodein blocks as large as 100letters long in order to encodeEnglish really efficiently. From Figure V-4 it appearsthat the entropy of English text lies somewherebetween 0.6 and 1.3 bits per letter. Let us assumea value of I bit per letter.Then it will take on the average100 binary digits to encodea block of 100letters. This meansthat there are 21ooprobable English sequences of 100letters. In our usualdecimal notatior, 2roocan be written as I followed by 30 zeroes,a fantas- tically large number. In endeavoringto find the probability in English text of all meaningful blocks of letters 100 letters long, we would have to count the relative frequency of occurrenceof each such block. Since there are l03ohighly likely blocks, this would be physically impossible. Further, this is impossible in principle. Most of these 1030 sequencesof lettersand spaces(which do not include all meaning- ful sequences)have never been written down! Thus, it is impossible to speakof their relative frequenciesor probabilities of suchlong blocks of lettersas derivedfrom English text. Here we are really confronted with two questions:the accuracy of the description of English text as the product of an ergodic source and the most appropriate statistical description of that source.One may beligvethat appropriateprobabilities do exist in some form in the human being even if they cannot be evaluated by the examinationof existing text. Or one may believethat the probabilities exist and that they can be derived from data taken in someway more appropriate than a naive computation of the 'We probabilities of sequencesof letters. may note, for instance, that equations 5.4 and 5.5 also give the entropy of an ergodic source. Equation 5.5 applies to a finite-statemachine. We have noted at the closeof Chapter III that the idea of a human being being in someparticular state and in that state producing some particular symbol or word is an appealingone. Somelinguists hold, however,that English grammar is incon- sistentwith the output of a finite-statemachine. Clearly, in trying r04 Symbols, Signals and I'{oise to understandthe structure and the entropy of actual Englishtext we would have to consider such text much more deeply than we have up to this point. It is safe if nbt subtle to apply arl exact mathematical theory blindly and mechanicallyto the ideal abstractionfor which it holds. We must be clever and wise in using even a good and appropriate mathematicaltheory in connectionwith actual, nonideal problems. We should seeka simple and realistic descriptionof the laws gov- erning English text if we areto relate it with communicationtheory certainly as successfullyas possible. Such a description -Tltt involve the grammar of the language,which we will discussin the next chapter. ln ant event, we know that there are some valid statisticsof English text, such as letter and word frequencies,and the coding theorems enableus to take advantage of such known statistics. If we encodeEnglish letter by letter, disregardingthe relative frequenciesof the leiters,we require 4.76binary digits per character (inciuding space).If we encodeletter by letter, takitg into account the relative probabilities of various letters,we require 4.03binary digits per character. If we encode word by *olq', taking inlo accountrelative frequencies of words,we require t.66 binary digits per character.And, by using an ingeniousand appropriatemeans' Shutrttonhas estimat;d the entropy of English text to be between .6 and 1.3 bits per letter, so that we may hope for even more efficient encoding. If, however, we mechanically push some particular procedure for finding the entropy of English text to the limit, we can easily engendernot only difficultiesbut nonsense.Perhaps we canascribe thi nonsensepartly to differencesbetween man as a sourceof English text attOour model of an ideal ergodic source,but pally *.ltrould ascribeit to the use of an inappropriate approach.We can surely say that the model of rnan as an ergodic sourceof text is good and useful if not perfect, and we should regard it highly for thesequa[ities. This chlpter has been long and heavy going, utd a summary seemsin order. Clearly, it is impossibleto recapitulate bri.ny 4 those matters which took so many pagesto expound. We can only re-emphasizethe most vital points. Entropy r05 In communicationtheory the entropy of a signal sourcein bits per symbol or per second gives the averagenumber of binary digits,per symbolor per second,necessary to encodethe messages produced by the source. We think of the messagesource as randomly, that is, unpre- dictably, choosingone among many possiblemessages for trans- mission.Thus, in connectionwith the messagesource we think of entropy as a measureof choice,the amount of choice the source excercisesin selectingthe one particular message that is actually transmitted. We think of the recipientof the message,prior to the receiptof the message,as being uncertain as to which among the many possible messagesthe messagesource will actually generate and transmit to him. Thus, we think of the entropy of the message sourceas measuringthe uncertainty of the recipient as to which messagewill be received,an uncertainty which is resolved on receipt of the message. If the messageis one among n equally probable symbols or messag€s,the entropy is log n. This is perfectly natural, for if we have log n btnary digits, we can use them to write out

)logn = n different binary numbers,and one of thesenumbers can be used as a label for eachof the n messages. More generally,if the symbols are not equally probable, the entropy is given by equation 5.1. By regardinga very long block of symbols,whose content is little dependenton precedirg symbols, as a sort of supersymbol, equation 5.1 can be modified to givethe entropy per symbol for information sourcesin which the proba- bility that a symbolis chosendepends on what symbolshave been chosen previously. This gives us equation 5.3. Other general expressionsfor entropl are given by equations5.4 and 5.5. By assuming that the symbols or blocks of symbols which a sourceproduces are encoded by a most efficientbinary codecalled a Huffman code, it is possible to prove that the entropy of an ergodic sourcemeasured in bits is equal to the averagenumber of binary digits necessary to encodeit. An error-free communication channel may not transmit binary r06 Symbols, Signals and Noise digits; it may transmit letters or other symbols.We can imagine atiaching Cifferentmessage sources to such a channel and seeking (usually-mathematically)the messagesource that causesthe en- itopy of tn. message tiansmitted over the channelto be as large of a messagetransmitted ur iorsible. This laigest possibleentropy over an error-free.hanttil is called the channel capacity.It canbe proved that, if the entropy of a source is less than the channel fupurity of the channel,messages from the sourcecan be encoded ,oihut ihey can be transmitted over the channel.This is Shannon's fundamenial theorem for the noiselesschannel. In principle,expressions such as equations5.1, 5.3, 5.4, and 5.5 enabl-eus to computethe entropy of a message source by statistical analysis of mersiger produced by the source.Even for an ideal ergodic source,thii would often call for impractically long compu- taiions. [n the caseof an actual source,such as Englishtext, some naive prescriptionsfor computing entropy canbe meaningless. An ipproximation to thebntropy can be obtainedby disregard- irg the lif.rt of somepast symbolson the probability of the source ptiaocing a particulai rymboi next. Suchan approximationto the entropy ir alwaystoo large and calls for encodingby meansof more Thus, we encode binary digits tiran are ibsolutely necessary. 1f English tJxt letter by letter, disregardittgeven the relativepr9b1- bilities of letters,*e require 4.76 binary digits per letter, while if we encodeword by word, taking into account the relativeproba- bility of words,we require !.66 binary digitsper letter' If we wanted to dd even better we would have to take into account other features of English such as the effect of the con- straints imposedby grammar on the probability that a message source will produce a particular word. While *r do not know how to encodeEnglish text in a highly efficient way, Shannonmade an ingeniousexperiment which shows that the .t itopy of English text must lie between.6 and 1.3bits per character.-in this experiment a person guess,edwhat letter iould follow the letters of a passageof text many letterslong. CHAPTERVI Language and Meaning

Tsn rwo cREArTRIUMnHs of information theory are establishing the channelcapacity and, in particular, the number of binary digits required to transmit information from a particular source and showing that a noisy communication channel has an information rate in bits per characteror bits per secondup to which errorless transmissionis possibledespite the noise.In eachcase, the results must be demonstratedfor discrete and for continuous sourcesand channels. After four chaptersof by no meanseasy preparation, we were finally ready to essayin the previous chapter the problem of the number of binary digits required to transmit the information gen- eratedby u truly ergodicdiscrete source. Were this book a text on information theory, we would proceed to the next logical step, the noisy discrete channel, and then on to the ergodic continuous channel. At the end of such a logical progress,however, our thoughts would necessarilybe drawn back to a considerationof the message sourcesof the real world, which are only approximatelyergodic, and to the estimation of their entropy and the efficient encoding of the messagesthey produce. Rather than proceedingfurther with the strictly mathematical aspectsof communication theory at this point, is it not more attractiveto pauseand considerthat chief form of communicationo

107 108 Symbols,Signals and Noise language,in the light of communication theory? A"q, in doing so, i-- *h; shiuld we noi t.t our thoughts stray a little in viewing T portant part of our world from the small eminence we have uttuirred?why should we not seewhether even the broad problems of languageand meaning seemdifferent to us in the light of what we have learned? In followirg such a course the reader should heed a word of caution. So fai the main emphasishas beenon what we know.What we know is the hard of science.However, scientists find it very difficult to sharethe things that they know with laymen.To under- stand the sure and the tr-urotably sure knowledge of sciencetakes the sort of hard thought which I am afraid was required of the reader in the last few chaPters. There is, however,another and easierthough not entirelyfrivo- lous side to science.This is a peculiar type of informed ignorance. The scientist's ignorance is rather different from the layman's ignorance, becauie the background of establishedfact and theory Jn which the scientistbases his peculiar brand of ignoranceex- cludes a wide rangeof nonsense{rom his speculations.In the higher and hazierreaches of the scientist'signorao@, we have scientifically informed ignorance about the origin of the universe, the ultimate basis of knJwledge,and the relation of our presentscientific knowl- edge to politics, ft.e will, and morality. In this particular chapter we will dabble in what I hope to be scientifically informed ignor- ance about language. The warning-is,6f .o.rrse, that much of what will be put forward here about lairguage is no more than informed ignorance.The warning seemsnecessary because it is very hard for laymen to tell scientific ignorancefrom scientific fact. Becausethe ignoranceis necessarily-expressedin broader, sketchier,and lessqualified terms than is the fact,it is easierto assimilate.Because it dealswith grand and unsolvedproblems, it is more romantic. Generally,it has a wider currency and is held in higher esteemthan is scientificfact. However hizardous such ignorance may be to the layman' it is valuable to the scientist.It i; this vision of unattained lands,of complacencyu4 spurs unscaled heights,-rnere which rescueshim from him beyond plorJding. But when the scientist is airing his ignorar". he usualiy knowi what he is doing, while the unwarned Languoge and Meaning 109 layman apparently often does not and is left scrambling about on cloud mountainswithout ever having set foot on the continentsof knowledge. With this caution in mind, let us return to what we have already encounteredconcemirg languageand proceed thence. In what follows we will confine ourselves to a discussionof grammatical English. We all know (and especiallythose who have had the misfortune of listerirg to a transcription of a seemingly intelligible conversationor technical talk) that much spokenEng- lish appears to be agrammatical, &S,indeed, much of Certrude Stein is. So are many conventions and clich6s. ooMeheap big chief " is perfectly intelligible anywhere in the country, yet it is certainly not grammatical. Purists do not consider the inverted word order which is so characteristicof second-ratepoetry asbeing grammatical. Thus, a discussionof grammatical English by no meanscovers the field of spoken and written communication, but it charts a coursewhich we can follow with somesense of order and interest. We have noted before that, if we are to write what will be acceptedas English text, certain constraintsmust be obeyed.We cannot simply setdown any word followirg any other. A complete grammar of a languagewould have to expressall of thesecon- straintsfully. It should allow within its rules the constructionof any sequenceof English words which will be accepted,at some particular time and accordirg to some particular standard, as grammatical. The matter of acceptanceof constructionsas grammatical is a dfficult and hazy one. The translators who prodrrced the King JamesBible werefree to say"feat not,'o"sin not," and "speaknot" as well as "think not," "do not," or "have not," and we frequently repeat the aphorism"want not, wastenot." Yet in our everyday speechor writing we would be constrainedto say "do not fear," 'odo "do not sin," or not speak," and we might perhapssay, "rf you arenot to want, you should not waste." What is grammatical certainly changeswith time. Here we can merely notice this and passon to other matters. Certainly, a satisfactorygrammar must prescribe certain rules which allow the constructionof all possiblegrammatical utterances ll0 Symbols, Signals and Noise and of grammaticalutterances only. Besidesdoing this, satisfactory rules of gtummar should allow us to analyzea sentenceso as to distinguiitr the featureswhich were determinedmerely by the rules of grammar from any other features. If we oncehad suchrules, we would be ableto make a new esti- mate of the entropy of English text, for we could seewhat part of sentencestructure is a mere mechanicalfollowittg of rules and what part involves choiceor uncertainty and hencecontributes to en- iropy. Further, we could transmit English efficientlyby transmit- ting-as a messageonly data concerningthe choicesexercised in constructing sentences; at the receiver,we could let a grammar machine bui]d grammaticalsentences embodying the choicesspeci- fied by the receivedmessage. Even grammar,of couise,is not the whole of language,for a sentencecan be very odd evenif it is grammatical.We can imagine that, if a machitrecipable of producing only grammaticalsentences made its choicesat random, it might perhapsproduce sucha sen- 'oThe tence as chartreusesemiquaver skinned the feelingsof the rnanifold." A man presumablymakes his choicesin someother way if he says,"Th; blue note flayed the emotionsof the multi- tude." The differencelies in what choicesone makeswhile follow- irg grammatical rules,not in the rules themselves.An understand- i"E lf grammar would not unlock to us all of the secretsof language,brrt it would take us a long step forward. What sort of rules will result in the production of grammatical sentencesonly and of all grammatical sentences,even when choices are made at random? In Chapter III we saw that English-like sequencesof words can be produced by choosinga word at tan- dom accordittg to its probability of succeedinga preceding se- quenceof *ords someM *ords long. An exampleof a second-order word approximatioo, in which a word is chosenon the basisof its succeedingthe previousword, was given' One can construct higher-order word approximationsby using . the knowledge of Englih which is storedin our heads.One can' for instance,lUtuin u-footth-order word approximation by simply showing a sequenceof threeconnected words to a personand ask- ing hi; to think up a sentencein which the sequenceof words occurs and to add the next word. By going from person to person a long string of words can be constructed,for instance: Language and Meaning I I I

l. When morning broke after an orgy of wild abandon he said her head shook vertically aligned in a sequence of words signify- ing what. 2. It happenedone frosty look of treeswaving gracefullyagainst the wall. 3. When cooked asparagushas a delicious flavor suggesting apples. 4. The last time I sawhim when he lived. These"sentences" are as sensibleas they are becauseselections of words were not madeat random but by thinking beings.The point to be noted is how astonishinglygrammatical the sentences are, despitethe fact that rules of grammar (and sense)were ap- plied to only four wordsat a time (the three shownto eachperson and the one he added).Still, example 4 is perhaps du6iously grammatical. If Shannonis right and thereis in Englishtext a choiceof about I bit Per symbol,then choosingamong a group of 4 words could involve about 22 binary choices,or a choiceamong some l0 mil- lion 4-word combinations.In principle,a computercould be made to add words by usingsuch a list of combinations,but the result would not be assuredlygrammatical, nor could we be sure that this cumbersomeprocedure would produceall possiblegrammati- cal sequencesof words.There probably are sequencesof words which could form a part of a grammaticalsentence in one case and could not in anothercase. If we included sucha sequence,we would produce some nongrammaticalsentences, and, if we ex- cludedit, we would fail to produceall grammaticalsentences. If we go to combinationsof more than four words,we will favor grammar over completeness.If we go to fewer than four words, we will favor completenessover grammar. We can't have both. The idea of a finite-statemachine recurs at this point. Perhaps at eachpoint in a sentencea sentence-producingmachine shoutd be in a particularstate, which allows it certain choicesas to what stateit will go to next.Moreover, perhaps such a machinecan deal with ceqtainclasses or subclassesof words, suchas singular nouns, plural nouns,adjectives, adverbs, verbs of various tenseand num- ber, and so on, so asto producegrammatical structures into which words can be fitted rather than sequencesof particular words. The idea of grammar as a finite-statemachine is particularly tlz Symbols,Signals and I'{oise must be a appealingbecause a mechanistwould assertthat man number finite-statemachine, because he consistsof only a finite of cells,or of atoms if we push the matter further. Noam chomsky, a brilliint and highly regardedmodern linguist, proper rejects the finite-state machine as either a possibleor a there ,-rroo.tof grammatical structure.chomsky points out that which can- aremanyrules for constructingsequences of characters the rule not be embodied in a finite-statemachine. For instance, until the might be, chooseletters at random and write them down z letter Z showsup, then repeat all the letters sincethe preced*g and so in reverseorder, and then go on with a new set of letters, clear on. This processwill prodJce€ sequenceof letters showing the pos- evidenceof long-rurg" order. Further, there is no limit to machine sible length of ihe sequencebetween Z's. No finite-state can simulate this processand this result. length Chomsky pointi out that there is no limit to the possible Englishsen- of gra**utibal sentencesin English and arguesth{ to rule tencesare organ tzedin such a *uy tha-tthis is sufficient English text' out a finite-statemachine as a souice of all possible But, can we really regard a sentencemiles long as grap*ulical produce when we know darnel wel that no one ever has or will if it existed? such a sentenceand that no one could understandit being To decide such a question,we must have a standardof or not being grammatical. while bno*sky seemsto refer being as atical, and some questionsof punctuation and meaning [ru,,,r' is: a sen- ivell, to spoken English, I think that his real criterion with a natural tence is giam mattciif, in reading or sayingit aloud gram- expressi6nand thoughtfully but ingenuously,it is deemed matical by a persorr-*lto speaksitl or perhaps by u person Yho not-bother hearsit. Slme problems which might p_lagueothers may and gram- Chomskybecalse he speaksre-aikably well-connected matical English. in a whether or not the rules of grammar can be embodied that it is finite-state machine, chomsky of.rr persuasiveevidence by basing wrong and cumbersometo try to generate a sentence words already the choice of the next word entirely and solely on of sentence written down. Rather, Chomsky considersthe course generationto be somethingof this sort: Language and Meaning r13 We start with one or another of severalgeneral forms the sen- tencemight take; for example,a noun phrasefollowed by a verb phrase.Chomsky calls such a pafiicular form of sentencea kernel sentence.We then invoke rules for expandingeach of the parts of the kernel sentence.In the caseof a noun phrasewe may first de- scribe it as an article plus a noun and finally as "the man." fn the caseof a verb phrasewe may describeit as a verb plus an object, the object as an article plus a noun, and, in choosing particular words, oS"hit the ball." Proceedingin this way from the kernel sentence,noun phraseplus verb phrase,we arrive at the sentence, "The man hit the ball." At any stagewe could have made other choices.By making other choicesat the final stageswe might have arrived at-"A girl Jaught a cat." Here we seethat the elementof choiceis not exercisedsequen- tially along the sentencefrom beginnirg to end. Rather, we choose an over-all skeletalplan or schemefor the whole final sentenceat the start. That schemeor plan is the kernel sentence.Once the kernel sentencehas beenchosen, we passon to parts of the kernel sentence.From eachpart we proceedto the constituentelements of that part and from the constituentelements to the choiceof particular words.At eachbranch of this treelike structure grow- itg from the kernel sentence,we exercisechoice in arriving at the particular final sentence,and, of course,we chosethe kernel sen- tence to start with. Here I have indicatedChomsky's ideas very incompletely and very sketchily. For instance, in dealing with irregular forms of words Chomsky will first indicate the root word and its particular grammatical form, and then he will apply certain obligatory rules in arriving at the correctEnglish form. Thus, in the branchirg con- struction of a sentence,use is made both of optional rules, which allow choice, and of purely mechanical,deterministic obligatory rules, which do not. To understandthis approachfurther and to judge its merit, one must refer to Chomsky'sbook,1 and to the referenceshe gives. Chomsky must, of course,deal with the problem of ambiguous o'The sentences,such as, lady scientistmade the robot fast while she ate." The author of this sentence,a learned information theo- 's-Gravenhage, l Noam Chomsky, , Mouton and Co., 1957. l14 Symbols, Signals and Noise four rist, tells me that, allowing for the vernacular,it has at least as an different meanings.It is perhaps too complicatedto serve example for detailedanalYsis. We might think that ambiguity arises_only when.oneor more words can assumedifferent meaningsin what is essentiallythe y-*' structure.This is the casein "he was mad" (either I*r"*atical o'the his cups). angry or lnsane)or pilot was high" (in the sky or in which ch-omsky,however, gives a simple J^1m_pleof a phrasein the the confusion is cle-arlygrammatical. In "the shooting of the noun hunters may be either the subject,as in "the hunters,,, o'the growlirg of lions,,or the obj..i, as in growing of flowers." Chomiky points out that different rulesof transformationapplied of to different kernel sentencescan lead to the same sequence a real grammaticalelements. Thus, "the picture was paintedby ..the seemto artist" and picture was painted by a new technique" correspond grammatically word for word, yet-the. first sentence o'a the could have arisenas a transformation of real artist painted picture,, while the secondcould not have arisenas a transforrna- well tion of a sentencehavirg this form. When the final words as is as the final grammatical elementsare the same,the sentence ambiguous. Chomsky also facesthe problem that the distinction between we say the provincesof grammar ind meanittgil not clear. Shall nouns? that grammar allows adjectivesbut noi adverbsto modify 'ocolorless This allows gi""n." Or shouldgrammar forbid the asso- with ciation of someadjertlrr.r with somenouns, of somenouns are some verbsoand so on? With one choice,certain constructions grammatical but meaningless;with the other they are ungram- matical. we see that chomsky has laid out a plan for a grammar of English which involves at eachpoint in the synthesis_ofa sentence ceriain stepswhich areeither obligatoryor optional.The processes allowed d this grammar cannof be carried out by u finite-state machine, but they can be carried out by a more generalmachine an called a ,which is a finite-statemachine plus from infinitely long tape on which symbols can be written and which symbolscin be read or erased.The relation of Chomsky's grammar to such machinesis a proper study for thoseinterested in automata. Languoge and Meaning I l5 Wb should note, however,that if we arbitrarily impose some bound on the length of a sentence,even if we limit the length to 1,000or I million words,then Chomsky'sgrammar does correspond to a finite-statemachine. The imposition of such a limit on sen- tencelength seemsvery reasonablein a practical way. Once a generalspecification or model of a grammar of the sort Chomskyproposes is setup, we may askunder what circumstances and how can an entropy be derived which will measurethe choice or uncertaintyof a messa1e source that producestext according to the rules of the grammar?This is a question for the mathema- tically skilled information theorist. Much more important is the production of a plausible and workable grammar.This might be a phrase-structuregrammar, as Chomsky proposes,or it might take some other form. Such a grammar might be incompletein that it failed to produce or ana- Iyze some constructionsto be found in grammatical English. It seemsmore important that its operationshould correspond to what we know of the production of English by human beings.Further, it should be simple enoughto allow the generationand analysis of text by meansof an electronic computer. I believe that com- puters must be used in attackirg problems of the structureand statisticsof Englishtext. While a great many people are convinced that Chomsky's phrase-structureapproach is a very important aspectof granunar, somefeel that his picture of the generationof sentencesshould be modifiedor narrowedif it is to be used to describethe actual gen- eration of sentencesby human beings.Subjectively, in speaking or listeningto a speakerone hasa strong impression that sentences are generatedlargely from beginning to end. One also gets the impressionthat the persongenerating a sentencedoesn't have a very elaboratepattern in his head at any one time but that he elaboratesthe pattern as he goesalong. I suspectthat studiesof the form of grammarsand of the statis- tics of their use as revealedby languagewill in the not distant future tell us many new things about the nature of languageand about the natureof men as well. But, to say somethingmore par- ticular than this, I would have to outreachpresent knowledge- mine and others. A grammarmust specifynot only rulesfor putting differenttypes 116 Symbols, Signals and Noise of words togetherto make grammatical structures;it must divide the actual riords of English into classeson the basisof the places in which they can upp.u-tin grammaticalstructures. Linguists make with- such a divisior purely on tlie basis of grammatical function of u out invoking utty idea of meaning. Thus, all we can expect in- grammar is the generationof grammatical sentences,and this Il.rd., the exariple given euiliet: "The chartreusesemiquaver of skinned the feelings or the manifold." certainly th:. division and words into gru---utical categoriessuch as nouns, adjectives, verbs is not.'our sole guide .oirr.tning the use of words in produc- irg English text. What doesinfluence the choice among words when the words not ar used in constructing grammatical sentencesare chosen, who, random by a macttine] but rather by a live human being to the through long training, speaksor wriies English according by u rules or trr. grammult rnis question is not to be answered Eng- vague appeafto the word meining onl criteria in producing lish sentencescan be very complicatedindeed. Philosophersand words psychologistshave ,prrrriuted about and studied the useof en- u"O languagefor gett.rutions,and it is as hard to say anythig par- tirely ,J* ibout ittir as it is to say anything entirgly true. In century ticuiar, what Bishop Berkeley wrote in the eighteenth concerningthe use of tuttguageis so sensiblethat one can scarcely make u ,.ironable comrnentwithout owing him credit. sets Let us supposethat a poet of the scanniog,rhyming T-lool be exer- out to write a grammaticalpoem. Much of his choicewill cisedin selectingwords whiih fit into the chosenrhythmic pattern' which rhyme, uid which have alliteration and certain consistent Poe's or agreeablesound values. This is particularly notable in o'The "Th; Be[s," "lJuaLume,"and Raven"' Further, the poet will wish to bring togetherwords which through or im- their sound as well as their sensearouse related emotions Poe's pressionsin the reader or hearer.The different sectionsof l.Th, Bells" illustrate this admirably. There is a marked contrast between: How theY tinkle, tinkle, tinkle' In the icY air of night! While the stars that oversPrinkle Language and Meaning t17 All the heavens,seem to twinkle In a crystallinedelight; . . . and

Through the balmy air of night How they ring out their delight! From the molten-goldennotes, And all in tune, What a liquid ditty floats. . .

Sometimes, the picture may be harmonious, congruous, and moving without even the trivial literal meaning of this verseof Poe's, as in Blake's two lines: Tyger,Tyget, burning bright In theforests of thenight . . .

In instancesother than poetry,words may be chosenfor euphony, but they are perhapsmore often chosenfor their associationswiitr ability lttd to excitepassions such as thoselisted by Berkeley:fear, love, hatred,admiration, disdain. Particular *ords or exprissions move eachof us to suchfeelings. In a given culture,certain words and phraseswill havea strong and common effecton the majority of hearers,just as the sights,sounds or eventswith which they aie associateddo. The wordsof a hymn or psalm can induce a strong religiousemotion; political or racial epithets,a senseof alarm oi contempt, and the words and phrases of dirty jokes, sexual excitement. One emotion which Berkeley does not mention is a senseof understanding.By mouthing commonplaceand familiar patterns of words in connectionwith ill-understoodmatters, we can asso- ciate someof our emotionsof familiarity and insight with our per- plexity about history,life, the nature of knowledge,consciousness, death,and Providence.Perhaps such philosophi ur makesuse of common words should be consideredin terms of assertionof a reassuranceconcerning the importance of man's feelingsrather than in terms of meaning. One could spenddays on end examiningexamples of motivation in the choiceof words,but we do continually get 6ack to the matter of meaning.Whatever meaning may be, all elseseems lost without 118 SYmbols,Signals and Noise it. A Chinesepoem, hymn, deprecation,or joke will havelittle effect who on me unlessI understandittinrse in whateversense those know a languageunderstand it. Though Lohn Cherry, a well-known information theorist,ap- pears to-object,I think ittut it is fair to regardmeaningful language code as a sort of codeof communication.It certainlyisn't a simple more in which one mechanicallysubstitutes a word for a deed-It's many like those elaboratecodes of early cryptography,in which word alternativecode words werelisted for eachcommon letter or listingsmay (in order to suppressfrequencies). But in language,E entries overlap. And one prtrott's code book may have different from another's,which is sure to causeconfusion. If we regard langu ageas an imperfegt code of communication, user. we must uftimatel! rJrt meaning back to the intent of the when I It is for this reasonthat I ask, "What do you mean?" even long have heard your words. Scholarsseek the intent of authors Con- dead, and th. SupremeCourt seeksto establishthe intent of gressin applying the letter of the law' Furth.;;if r firco-e convinced that a man is lying, I inteqpret If I his words as meaningthat he intends to flatter or deceiveme. find that a sentencef,ur been produced by u computer, I inteqpret it to mean that the computer is functiotring very cleverly. we I don't think that s,rih matters are quibbles;it seemsthat if are driven to such considerationsin connection with meaning we do regard ranguageas an imperfect,codeof communication, We are and as one which is s6metimesexploited in devious ways. certainly far from any adequatetreatment of such problems. be called Grammatical sentencesdo, however,have what might a formal meaning,regardless of intent. If we had a satisfactory relationsbe- gramm ar,a *u.liirr. ,hould be able to establishthe object, tween the words of a sentence,indicating subject,verb, 1nd The what modifying phrasesor clausesapplyro what other words. in sen- next problem beyond this in seekingsuch formal meaning qualf tencei is the probl.m of associatingwords with objects, itt' the world of actions, or relationsin the world about us, including man's societyand of his organizedknowledge' - - have In the simple communic-ationsof everydaylifq, we don't proper much trouble in associatingthe words that areused with the Language and Meaning ll9 objects,qualities, actions, and relations.No one has trouble with "close the eastwindow" or "Henry is dead," when he hearssuch a simple sentencein simple, unambiguous surroundings. fn a familiar American room, anyone can point out the window; we have closedwindows repeatedly,and we know what direction east is. Also, we know Henry (if we don't get Henry Smith mixed up with Henry Jones),and we have seendead people.If the sentence is misheard or misunderstood,a second try is almost sure to succeed. Think, however,how puzzhngthe sentenceabout the window would be, evenin translation,to a shelterlesssavage. And we can get pretty puzzledourselves concerning such a question as, is a virus living or dead? It appearsthat much of the confusion and puzzlementabout the associationsof words with things of the world arosethrough an effort by philosophersfrom Plato to Locke to give meaning to such ideasas window, cat,or deadby associatingthem with generalideas or ideal examples.Thus, we are presumedto identify a window by its resemblanceto a generalidea of a window, to an ideal window, in fact, and a catby its resemblanceto an ideal cat which embodies all the attributes of cattiness.As Berkeleypoints out, the abstract idea of a (or the ideal) triangle must at once be ooneitheroblique, rectangle,equilateral, equicrural nor scaleron,but all and none of theseat once." Actually, when a doctor pronouncesa man dead he does so on the basisof certain observed signswhich he would be at a loss to identify in a virus. Further, when a doctor makes a diagnosis,he doesnot start out by making an over-all comparisonof the patient's condition with an ideal picture of a disease.He first looks for such signsas appearance, temperature, pulse, lesions of the skin, inflam- mation of the throat, and so on, and he also notes such symptoms as the patient can describeto him. Particular combinationsof signs and symptoms indicate certain diseases,and in differential diag- nosesfurther testsmay be used to distinguish among diseasespro- ducing similar signs ut O symptoms. In a similar manner, a botanist identifies a plant, familiar or unfamiliar, by the presenceor absenceof certain qualities of size, color, leaf shapeand disposition,and so on. Someof thesequali- r20 Symbols, Signals and I{oise ties, such as the distinction betweenthe leavesof monocotyledon- suchas size, ous and dicotyledonousplants, can be decisive;others, he is right can be merely indicativi. In the end, one is either sure may be or perhapswilling to believethat he is right; or the plant a new species. the idlaf Thus, in the workaday worlds of medicine and botany, useful diseaseor plant is conriiroous by its_absence as any actual and some criterios..Instead, we have lists oflqualities, somedecisive merely indicative. in The value of this observation has been confirmed strongly of recog- recent work toward enabling machinesto carry out tasks by nition or classification.puily workers, perhaps misled ??tly to an ideal philosophers,conceived the ideu of matihing a letter ideal spec- patter' or a letter or the spectrogramof a sound to an a pattern- trogram of the sound.The resultswere terrible. Audro), beneath matching machine with the burk of a hippo and brains voiceor a selected contempt, could reco Sutzedigitssp9fe1 by one I think, group of voices,but iudrey *ur ruoty fallible. we should, very simple conclude that human recognition *brks this way in casesonly, if at all. Later and more sophisticatedworkers in the field of recognition rather look for significant features.Thus, as a very simple ?ry-ple, describe than haviig an ideal pattern of a capital Q, otte might Q and with as a closedcurve without cornersor reversalsof curvature somethingattached between four and six o'clock. a simple In :.ftsi.,L. D. Harmon built at the Bell Laboratories recognizes device weigtring a few pounds which almost infallibly the digits from one to zlro written out as words in longhand.-Does You bet it this gidget match the handwriting against patterns? timet doesn,t! Instead, it asks such queJtiorr as, how many gid dotted or T's the stylus go above or below certain lines?were I's crossed? objects, Certainly, no one doubts that words refer to classesof with a large actionr, urr'dso on. we are surroundedby and involved which we' number of classesand subclassesof objects and actions objectsas can usefully associatewith words. Theseinclude such machines plants (p.i, sunflowers. . .), animals (cats, dogs' ' '), (skirts' (autos, . . .), buildings (houses,towers ' ' '), clothing Language and Meaning t2l

socks. . .), and soon. Theyinclude such very complicatedsequences of actions as dressingand undressing(the absent-minded,includ- itg myself, repeatedlydemonstrate that they can do this uncon- sciously);tying one'sshoes (an act which children have considerable difficulty in learning),eating, driving a car,reading, writing, adding figures,playing golf or tennis(activities involvirg a host of distinct subsidLaryskills), listening to music, making love, and so on and on and on. It seemsto me that what delimits a particular classof objects, qualities,actions, or relationsis not some sort of ideal example. Rather,rt is a list of qualities.Further, the list of qualitiescannot be expectedto enableus to divide experienceup into a set of logi- cal, shaqplydelimited, and all-embracingcategories. The language of sciencemay approachthis in dealing with a narrow range of experience,but the language of everyday life makes arbitrary, overlapping,and lessthan all-inclusivedivisions of experience.Yet, I believethat it is by meansof such lists of qualities that we iden- tify doors,windows, cats, dogs, men, monkeys,and other objects of daily life. I feel also that this is the way in which we identify common actions suchas running, skipping,jumpitrg, and tying, and such symbolsas words,written and spoken,&s well. I think that it is only through suchan approachthat we can hope to make a machine classify objects and experiencein terms of language,or recognrze and interpret languagein terms of other languageor of action.Further, I believethat when a word cannot offer a table of qualitiesor signswhose elements can be tracedback to common and familiar experiences,we have a right to be wary of the word. If we areto understandlanguage in sucha way that we can hope someduy to make a machinewhich will uselanguage successfully, we must havea grammarand we must havea way of relating words to the world about us,but this is of coursenot enough.If we areto regard sentencesas meaningful, they must in someway correspond to life aswe live it. Our livesdo not presentfresh objects and fresh actionseach duy. They are made up of familiar objects and familiar though compli- cated sequencesof actionspresented in different groupings and orders. Sometimeswe learn by adding new objects,or actions,or 122 Symbols, Signals and Noise and combinationsof objects or sequencesof actionsto our stock, and so we enrich or changeour lives. Sometimeswe forget objects actions. our particular actionsdepend on the objects and eventsabout when us. we dodge a car (a complicated sequenceof actions). but thirsty, we riop at the fountain and drink (another complicated someone recurrent *.qrr.ttce). In a packed crowd we may shoulder about out of the *uy aswe have donebefore. But our information and our in- the world doesnot all come from direct observation, shoving. fluence on others is happily not conflnedto pushing and words. we have a powerful tool for such purposes:langu ageand We use words to learn about relations among objectsand activi- instruc- ties and to rememberthem, to instruct othersor to receive For the tion from them, to influencepeople in one way or another. same words to be useful, the hearer must understandthem in the associates sensethat the speakermeans them, that is, insofar ashe no use, them with ,r"uily enough the sameobjects or skills. It's if he however,to tell i mur, lo r.ud or to add a column of figures have has never carried out theseactions before, so that he doesn't and not these skills. It is no use to tell him to shoot the aardvark the gnu if he has never seeneither' refer Further, for the sequencesof words to be useful,they mult advisea to real or possiblesequences of events.It's of no useto immedi- man to walk from tondon to New York in the forenoon ately after having eaten a seveno'clock dinner. not Thus, in someway the meaningfulnessof languagedepends associating only on grammaticatorder and * u workable way of it also de- words with collections of objects,qualities, and so on; we encounter pends on the structure of the world around us. Here that we can a real and an extremely seriousdifficultv with the idea another in some way translate sentencesfrom one languageinto and accuratblypreserve the "meanitg'" One obviour dim.ulty in trying to do this arisesfrom differences lower legl in classification.We can refer to either the foot or the leg. Hurt- the Russianshave one word for the foot plus the lower the samefor garians have twenty fingers (or toes), for the word is male or either appendage.'Tomost of us today, a dog is a dog, betweena female, but *.ri of an earlier era distinguishedsharply Language and Meaning r23 dog and a bitch. Eskimosmake, it is said,many distinctionsamong snow which in our languagewould call for descriptions, and for us even thesedescriptions would have little real content of impor- tance or feeling,because in our lives the distinctions have not been important. Thus, the parts of the world which are common and meaningful to thosespeakirg different languages are often divided into somewhatdifferent classes. It may br t*p6ssible to write down in different languageswords or simple sentencesthat specifyexact$ the samerange of experience. There is a graver problem than this, however. The range of experienceto which various words refer is not common among all cultures. What is one to do when faced with the problem of trans- o'tying lating a novel containitg the phrase, one's shoelace,"which as we havenoted describesa complicatedaction, into the language of a shoelesspeople? An elaboratedescription wouldn't call up the right thing at all. Perhapssome cultural equivalent (?) could be found. And how should one deal with the fact that "he built a house" meanspersonal tree cutting and adzrngin a pioneer novel, while it refers to the employment of an architect and a contractor in a contemporarystory? It is possibleto make somesort of translation betweenclosely related languageson a word-for-word or at least phrase-for-phrase 'oout basis, though this is said to have led from of sight, out of mind'o to "blind idiot." When the languagesand cultures differ in major respects,the translatorhas to think what the words mean in terms of objects,actions, or emotions and then expressthis meaning in the other language.It may be, of course,that the cul- ture with which the languageis associatedhas no closeequivalents to the objectsor actionsdescribed in the passageto be translated. Then the translatoris really stuck. How, oh how is the man who sets out to build a translating machine to copewith a problem such as this? He certainly cannot do so without in someway enabling the machine to deal effectively with what we referto asunderstanding. In fact, we seeunderstand- itg at work evenin situations which do not involve translation from one languageinto another. A screenwriter who can quite accuratelytransfer the essentialsof a sceneinvolvirg a dying uncle in Omsk to one involving a dying father in Dubuque will repeatedly |24 Symbols,Signals and Noise technical make compretenonsense in trying t-orephrase a simple but not statement. This is clearly becausehe understandsgrief science. now Having grappledpainfully with the word meaning,we are two sides. faced with the word undersianding.This seemsto have manip,tl?- If we understandalgebra or calculus,we can use their or to supply tions to solveproble,irs we haven't encounteredbefore under- proofs of theoremswe haven't seenproved. In this sense' merelyto standing is manifestedby u power to do, to create,not -To proves repeat. some degree, ut electronic computer which be- theorems in mathe^iti"ul which it has not encountered pe+ap.sbe fore (as computerscan be piogrammed to do) could sideto said to understand the subject. But there is an emotional ways understanding,too. when we can prove a theoremin several manners' and fit it togeiherwith other theoremsor factsin various it all fits when we can view a field from many aspectsand seehow wg together, we say that we understato the subjectdeepty. lttul with it. of a warm and confident feeling about our ability to cope have felt the warmth course, &t one time or anothet most of us we were at without manifesting the ability. And how disillusioned the critical test! In discussinglanguage from the point of view of information the imper- theory, we havi drided-utotrg a ticle of words, through of fectly charted channelsof g*-*ar and on into the obscurities can *.urring and understanding.This showsus how far ignorance theory,or take one. It would be absuid to assertthat information linguistics, anythirg else,has enabledus to solvethe probleml of At best' we of meurring,of understanding,of philosophy, of life' mechani- can perhai, ,uy that we are pushing a little beyond the of choice cal constraintj of languageand geiting at the amount the use that languageaffords.itrii idea s,rggestsviews concerning them. The and functio-n of language, but it does not establish these reader may share *"y f,eely offered ignorance concerning matterr, oi he may prefer his own sort of ignorance' CHAPTERVII Effirient Encoding

Wn wILL NEVERAGAIN understand nature as well as Greek philosophersdid. A generalexplanation of common phenomena in termsof a few all-embracingprinciples no longer satisfiesus. We know too much. We must explain many things of which the Greekswere unaware. And, we require that our theoriesharmonlze in detail with the very wide range of phenomenawhich they seek to explain. We insist that they provide us with useful guidance rather than with ration altzations.The glory of Newtonian me- chanicsis that it hasenabled men to predict the positionsof planets and satellitesand to understandmany other natural phenomena as well; it is surelynot that Newtonian mechanicsonce inspired and supporteda simple mechanisticview of the universeat large, includirg life. Present-dayphysicists are gratified by the conviction that all (non-nuclear)physical, chemical, and biological propertiesof mat- ter can in principlebe completelyand preciselyexplained in all their detaif by known quantum l4ws, asJumirg only^theexistence of electronsand of atomic nuclei of various massesand charges. It is somewhatembarrassing, however, that the only physicalsys- tem all of whoseproperties actually have been calculatedexactly is the isolatedhydrogen atom. Physicistsare able to predict and explain some other physrcaL phenomenaquite accuratelyand many more semiquantitatively. However,a basicand accuratetheoretical treatment, founded on electrons,nuclei, and quantum laws only, without recourseto

125 r26 Symbols, Signals and Noise other experimental data, is lacking for most common thermal, mechaniial, electrical,magnetic, and chemical phenomena.Tiac- irg complicated biological phenomenadirectly back to quantum Rrit principles seemsJo difficult as to be scarcelyrelevant to the real problems of biology.-mathernatics It is almost as if we knew the axiomsof an impor tant field of but could prove only a few simple theorems. Thus, we are surroundedin our world by u host of intriguing problerns and phenomenawhich we cannot hope to_relate through br. universal theo{y, however true that theory may be in principle. Until recently the problems of sciencewhich we comnonly asso- ciate with the field of physicshave seemedto many to be the most interestingof all the aipectsof naturewhich stitl puzzleus. Today, it is hardlo find problemsmore exciting than thoseof biochem- istry and physiology. t U.tieve,however, that many of the problemsraised by recent advancesin our technology areas challengingas any that faceus. What could be more excitirg than to explore the potentialitiesof electronic computers in proving theorems or in simulating other behavior we hive alwayJ thought of as "human"? The problems raised by electricalcommunication arejust as challenging.Accu- rate measurementsmade by electrical meanshave revolutionized physical acoustics.Studies carried out in connection with tele- new erain the study of pftb"r transmissionhave inaugurated_a speechand hearing,in which previously accepted,ideas of phys- iology, phonetics, and liguistics have proved to be inadequate. And-,-it is this chaotic and intriguirg field of much new ignorance and of a little new knowledge to which communication theory most directly applies. If communicaiion theory, like Newton's laws of motion, is to be taken seriously,it must give us useful guidancein connectionwith problems of communicition. It must demonstratethat it has a ieal and endurirg substanceof understandingand power.As the name implies, thir substanceshould be sought in the efficientand accuratetransmission of information. The substanceindeed exists. As we have seen,it existedin an incompletely understoodform even before Shannon'swork unified it and made it intelligible. Efficient Encoding r27 To deal with the matter of accuratetransmission of information we need new basicunderstanding, and this matter will be tackled in the next chapter.The foregoirg chaptershave, however,put us in a position to discusssome challengingaspects of the efficient transmissionof information. We have seen that in the entropy of an information source measuredin bits per symbol or per secondwe have a measureof the number of binary digits, of off-or-on pulses,per symbol or per secondwhich are necessaryto transmit a message.Knowirg this number of binary digits required for encodingand transmission,we naturally want a meansof actually encodingmessages with, at the most, not many more binary digits than this minimum number. Novicesin mathematics,science, or engineeringare foreverde- manding infallible, universal, mechanical methods for solving problems. Such methods are valuable in proving that problems can be solved,but in the caseof difficult problems they aresel- dom practical,and they may sometimesbe completelyunfeasible. As an example,we may note that an explicit solution of the gen- eral cubic equation exists,but no one ever uses it in a practical problem. Instead,some approximate method suited to the type or classof cubicsactually to be solvedis resortedto. The personwho isn't a novice thinks hard about a specificprob- lem in order to see if there isn't some better approach than a machine-likeapplication of what he has been taught. Let us see how this appliesin the caseof information theory. We will first considertlie caseof a discretesource which producesa string of symbolsor charac0ers. In ChapterV we saw that the entropy of a sourcecan be com- puted by examining the relative probabilities of occurrenceof various long blocks of characters.As the length of the block is increased,the approximationto the entropy getscloser and closer. In a particular case,perhaps blocks 5, or 10,or 100characters in length might be required to give a very good approximation to the entropy. We also saw that by dividing the messageinto successiveblocks of characters,to eachof which a probability of occurrencecan be attached,and by encodingthese blocks into binary digits by means r28 Symbols,Signals and I'{oise of the Huffman code, the number of digits used per character approachesthe entropy as the blocks of charactersare madelonger and longer. Here indeed is our foolproof mechanicalscheme. Why don't we simply use it in all cases? Tb seeone reason,let us examine a very simple case.S,tppose that an information source produces a binary digit, a 1 or & 0, randomly and with equal probability and then follows it with the same Oiglt twice agarr-before producing independently another digit. The messageproduced by such a sourcemight be: 0001 1 10001 1 I 1 1 10000001 1 I Would anyone be foolish enough to divide such a message successivelyinto blocksof I ,2,3,4,5, etc.,chatacters, compute the probaUititiesof the blocks, encodethem with a Huffman code, and note the improvement in the number of binary digits required for transmissiorr?I don't know; it sometimesseems to me that there are no limits to human follY. Clearly, a much simplei procedure is not only adequatebyt absoluteiy perfect.Becam. of the repetition,the entropy is clearly the sameas for a successionof a third asmany binary digits chosen randomly and independentlywith equalprobability of I or 0. That is, it ts r/tbinary digit per chiracter of the repetitiousmess age. And, we can transmit th; messageperfectly efficientlysimply by sendittg every third characteranO ietiing the recipient to write down each received characterthree times. This exampleis simple but important. It illustratesthe fact that we should look for natural structurein a message source, for salient features of which we can take advantage. The discussionof English text in Chapter IV illustratesthis. We might, for instance, transmit text merely as a picture by television or facsimile.This would take many binary digits per character.We would be providing a transmissionsystem capable of sendingnot only English text, but Cyrillic, Greek, Sanskrit,Chinese, and other text, and picturesof landscapes,storms, earthquakes, and Marilyn Monro" ur well. We would not be takirg advantageof the elemen- tary and all-important fact that English text is made up of letters. if we encod. ft glish text letter by letter, takitg no accountof Efficient Encoding t29 the different probabilities of various letters (and excluditg the space),we need4.7 binary digits per letter. If we take into account the relative probabilities of letters, as Morse did, we need 4.14 binary digits per letter. If we proceededmechanically to encode English text more efficiently,we might go on to encodingpairs of letters, sequences of three letters, and so on. This, however, would provide for encodirg many sequencesof letterswhich aren't Englishwords. It seemsmuch more sensible to go on to the next larger unit of Englishtext, the word. We haveseen in Chapter IV that we would expectto use only about 9 binary digits per word or 1'.7binary digits per characterin so encodingEnglish text. If we want to proceedfurther, the next logical step would be to consider the structure of phrasesor sentences;that is, to take advantageof the rules of grammar. The trouble is that we don't know the rules of grammar completelyenough to help us, and if we did, a communicationsystem which made use of theserules would probably be impractically complicated.Indeed, in practical casesit still seemsbest to encodethe lettersof English text inde- pendently,using at least5 bin ary digits per character. It is, however, important to get some idea of what could be accomplishedin transmitting English text. To this end, Shannon consideredthe following communicationsituation. Supposewe ask a man, using all his knowledgeof English,to guesswhat the next characterin someEnglish text is. If he is right we tell him so, and he writes the characterdown. If he is wrong, we may either tell him what the characteractually is or let him make further guesses until he guessesthe right character. Now, supposethat we regard this processas taking place at the transmitter,and say that we have an absolutelyidentical twin to guessfor us at the receiver,a twin who makesjust the samemis- takesthat the man at the transmitter does.Then, to transmit the text, we let the man at the receiverguess. When the man at the transmitterguesses right, so will the man at the receiver.Thus, we need sendinformation to the man at the receiveronly when the man at the transmitter guesseswrong and then only enough infor- mation to enablethe men at the transmitter and the receiverto write down the right character. 130 Symbols, Signals and Noise Shannon has drawn a diagram of sucha communication system, which is shown in Figure VII- 1. A predictor acts on the original text. The prediction of th. next letter is compared with the actual letter. If an error is noted, someinformation is transmitted.At the receiv er, apredictionof the next characteris made from the already reconstrucied text. A comparison involving the receivedsignal is carried out. If no error hai been made, the predicted characteris o'reduced used; if an error has been made, the text" information coming in will make it possibleto correct the error. Of course,we don't hive such identical twins or any other highly effective identical predictors. Nonetheless,a much simpler 91, beenused in purely mechanicaliystem basedon this diagram lut irunrtttitting pictures.Shannon's purpose was different, however. By using just one person,and not twins, he was able to find what transmissionrate would be required in such a systemmerely by examining the errors made bt the one man in the transmitter situation. The results are summed up in Figure V-4 of ChapterV. A better prediction is made on the basis of the 100 preceding letters than on the basisof the preceding 10 or 15. To correctthe errors in prediction, something between0.6 and 1.3 binary digils per chara^cteris required. ThiJ te[s us that, insofar as this result is correct, the entropy of English text must lie between.6 and 1.3 bits per letter. A discretesource of information providesa good examplefor discussionbut not an example of much practical importancein communication. The reasonls that, by modern standardsof elec- trical communication, it takes very few binary digits or off-or-on pulses to send English text. we have to hurry to speak a few hundred words a minute, yet it is easy to send over a thousand words of text over a telephoneconnection in a minute or to send l0 million words a minuie over a TV channel,and, in principleif not in practice,we could transmit some50,000 words a minuteover

COMPARISON COMPARISON ORIGINAL REDUCED TEXT TEXT

Fig.VII-L Efficient Encoding l3l a telephonechannel and some 50 million words a minute over a TV channel.As a matterof fact, in practical caseswe have even retreatedfrom Morse'singenious code which sendsan E fasterthan a Z. A teletypesystem uses the samelength of signalfor any letter. Efficient encodingis thus potentially more important for voice transmissionthan for transmissionof text, for voice takes more binary digits per word than doestext. Further, efficientencodirg is potentially more important for TV than for voice. Now, a voiceor a TV signalis inherentlycontinuous as opposed to English text, numbers,or binary digits, which are discrete. Disregardingcapitaltzatton and punctuation,an English character may be any one of the lettersor the space.At a given moment,the sound wave or the human voice may have anypressure at all lying within somerange of pressures.We have noted in Chapter IV that if the frequenciesof sucha continuoussignal are limited to some B, the signal can be accurately representedby 28 samplesor measurementsof amplitudeper second. We remember,however, that the entropy per characterdepends on how many valuesthe charactercan assume.Since a continuous signalcan assumean infinite number of differentvalues at a sample point, we are led to assumethat a continuoussignal must havean entropy of an infinite number of bits per sample. This would be true if we requiredan absolutelyaccurate repro- duction of the continuoussignal. However, signals are transmitted to be heardor seen.Only a certain degreeof fidelity of reproduc- tion is required.Thus, in dealing with the sampleswhich specify continuous signals, Shannon introduces a fidelity criterion. To reproducethe signalin a way meetingthe fidelity criterion requires only a finite numberof binary digits per sampleor per second,and hencewe can saythat, within the accuracyimposed by u particular fidelity criterion,the entropyof a continuoussource has a particu- lar value in bits per sampleor bits per second. It is extremely important to reahzethat the fidelity criterion should be associatedwith long stretchesof the signal, not with individual samples.For instance,in transmitting a sound,if we make each sample l0 per cent larger, we will merely make the soundlouder, and no damage will be done to its quatity. If we make a random error of l0 per centin eachsample, the recoveredsignal r32 Symbols,Signals and Noise will be very noisy. Similarly, in picture transmissionan error in smoothly gradually brightness or contrast which changes, Td urios the picture will passunnoticed, but an eqyal but random error differing from point to point will be intolerable. We have seettthat we can senda continuoussignal by quantizing each sample,that is, by allowing it to assumeonly certain pre- assignedvalues. It appearsthat 128values are sufficientfor the tranimission of telepttbne-qualityspeech or of pictures.We must reaLize,however, that, in quantizing a speechsignal or a piclu1e u u:ty unsophisti- signal sampleby sample,we are proceedingI cated *urr.r, just aswe are if we encodetext letter by letterrather than word by word. The ttu-. hyperquantizationhas been given to the quantizalio." of continuour-tigttulsof more than one sample at a time. This is undoubtedly th; true road to efficient encoding of continuous signals.One can easily ruin his chancesof efficient encodingcom- pl-etelyby quantizing ittr samplesat the start. Yet, to hyperquantize ; conlitt.ro.rssignal is not easy.Samples are quantizedindepend- ently in presentpulse code modulation systemsthat carry telephone conversitions from telephoneoffice to telephoneoffice and from town to town, and in ltre digital switchirg systemsthat provide much long distanceswitching. Samples are quantizedindepend- ently in sending pictures back from Mars, Jupiter and farther 'rplanets. In pulse code modulation, the nearest of one of a number of standird levelsor amplitudes is assignedto each sample.As an example, if eight leveli were used, they might be equally spaced as in i of Figure VII-2. The level representingthe sampleis thln transmitted 6y r.ttding the binary number written to the right of it. Some subtiety of encodit g can be used even in such a system. Instead of the equally spacedamplitudes of Figure VII'2,a, we can use quant izatioi levbls-which areclose together for small signals and farther apartfor large signals,as shown in Figure VII-2b. The reason for doirg this is, of course,that our ears are sensitiveto a fractional errorln signal amplitude rather than to an error of so many dynes below or above averagepressure or so many volts posiiive or negative,in the signal. By suchcompanding (compressing ittr high ampiitudes at the transmitter and expanding them again Efficient Encoding 133

-,rn- t I t lll

l| -rrrr- | o

-r-r- I I O - tot 6illJ r _ tol =l -rrrr-EIOO Fr - too ZERO J.- or f AMPLITUDE (L | -t\" ' >* -O1O

- ooo ,ooo

(a) (b) Fig. VII-2 at the receiver),l binary digits per samplecan give a signal almost as good as 1l binary digits would if the signal levelstransmitted were separatedby equal differencesin amplitude. To sendspeech more efficiently than this, we need to examine the characteristicsboth of speechand of hearing. After &[, we require only enough accuracy of transmission to convince the hearer that transmissionis good enough. Efficiencyis not everything.A vocoder can transmit only one voice, not two or more at a time. Also, vocodersbehave badly when one speaksin the presenceof loud noise.Trying to transmit the actualspeech waveform more efficiently,or waveformdecoding,, avoidsthese probleffis, but 15,000-20,000binary digitsper second are requiredfor acceptablespeech. Figure VII-3 sho*s the wave forms of severalspeech sounds, that is, how the pressureof the sound wave or the voltage repre- sentingit in a communicationsystem varies with time. We seethat many of the wave forms, and especially those for the vowels (a through d), repeat over and over almost exactly. Couldn't we perhapstransmit just one completeperiod of variation and useit Symbols, Signals and Noise lr,l lsl

lel til

lsl

lil lil

TIME IN SECONDS Fig. VII-3 Efficient Encoding 135 to replaceseveral succeeding periods? This is very difficult, for it is hard for a machineto determinejust how long a period is in actual speech.It has been tried. The speechreproduced is intelli- gible but seriouslydistorted. If speechis to be encodedefficiently, a much more fundamental approachis required.We must know how greata variety of speech soundsmust be transmittedand how effectiveour senseof hearing is in distinguishingamong speechsounds. The fluctuationsof air pressurewhich constitute the soundsof speechare very rapid indeed,of the order of thousandsper second. Our voluntary control over our vocal tracts is exercisedat a much lower rate. At the most, we changethe manner of production of sounds a few tens of times a second.Thus, speechmay well be (and is) simpler than we might concludeby examining the rapidly fluctuating soundwaves of speech. What control do we exerciseover our vocal organs?First of all, we control the productionof voicedsounds by our control over our vocal cords.These are two lips or folds of musculartissue attached to a cartilaginousbox calledthe larynx, which is prominent in man as the Adam's apple.When we arenot giving voice to soutrd,these are wide open.They can be drawn togethermore or lesstightly, so that when air from the lungs is forced through them they ernit a sound somethinglike a Bronx cheer.If they are held very tight, the soundhas a high pitch; if they are more relaxed,the soundhas a lower pitch. The pulsesof air passingthe vocal cordscontain many frequen- cies.The mouth and lips act as a complexresonator which empha- sizescertain frequenciesmore than others.What frequenciesare emphasizeddepends on how much and at what position the tongue is raisedor humpedin the mouth, on whetherthe soft palateopens the nasal cavitiesto the mouth and throat, and on the openingof the jaws and the position of the lips. Particular soundsof voiced speech,which includes vowels and other continuents,such as m and r, are formed by exciting the vocal cords and giving particular characteristicshapes to the mouth. Stop consonents,or plosives,such as p, b, g, t, are formed by stopping off the vocal passage at various points with the tongue or lips, creatitg an air pressure,and suddenlyreleasing it. The vocal 136 Symbols, Signals and Noise cords are usedin producing some of thesesounds (b, for instance) and not in producing others (p, for instance). Fricatives,such aJ r and sh, areproduced by the passageof air through variousconstrictions. Sometirnes the vocal cordsare used as well (in a zh sound, &sLn azvre). A specificationof the movementsof the vocal organswould be much more slowly changirg than a descriptionof the soundpro- duced. May this not be i clue to efficientencoditg of speech? In the early thirties,long before Shannon'swork on information theory, Homer l)udley of the Bell Laboratoriesinvented sucha form of speechtransmission, which he called the vocoder(from voice coder).The transmitting (analyzer)and receivitg (synthe- sizer) units of a vocoderare illustrated in Figure VII-4. In the analyzer,an electricalreplica of the speechis fed to 16 filters, each oi which determinestfie strengthof the speechsignal in a particular band of frequenciesand transmits a signalto the synthesizerwhich givesthiJ information. In addition, an analysis is made to determinewhether the soundis voiceless(s, f ) or voiced (o, u) and, if voiced,what the pitch is. At the synthesrzer,if the sound is voiceless,a hissingnoise is produced;lf the soundis voiced a sequenceof electricalpulses is produced at the proper rate, correspondingto the puffs of air Lpassing the vocal cords of the speaker. The hiss or pulses are fed to an affay of filters, each pas.singa band of frequ^enciescorresponding to a particular filter in the analyzer.The amount of sound passitg through a particular filter in the synthesizeris controlled by the output of the corresponding analyz.t ntt.r so as to be the sameas that which the analyzerfilter indiiates to be presentin the voice in that frequet c-y.range. This processresults in the reproduction of intelligible speech. In effect, the analyzerlistens to and analyzesspeech, and then instructs the synth-esizer,which is an artificial speakingmachine, how to say th;words all over againwith the very pitch and accent of the speaker. Most vocodershave a strong and unpleasantelectrical accent. The study of this has ted to new and important ideas concerning what determinesand influencesspeech quality; we cannot afford time to go into this matter here. Even imperfect vocoderscan be Efficient Encoding r37

t UJ

uIU tpt I,U ta E.

E. TU z a E, N o J" a trru

lq llltrtlttlrl 9< FYZOE[Hg E'Ua u6L,- wA UZ *2sb oo ,E F:Bg Pd oo2r \ u6 I i I I s I ,:b ffi83 t\ d:;E

a UJ a U L oS u lllltilillrl ZW

Ir IS HARDTo puT oNESELFin the place of another, and, especially, it is hard to put oneselfin the place of a person of an earlier duy. What would a Victorian have thought of present-day dress?Were Newton's laws of motion and of gravitation as aston- ishing and disturbing to his contemporariesas Einstein's theory of relativity appearsto have been to his? And what is disturbing about relativity? Present-daystudents accept it, not only without a murmur, but with a feeling of inevitability, as if any other idea must be very odd, surprising,and inexplicable. Partly, this is becauseour attitudes are bred of our times and surroundings.Partly, in the caseof scienceat least, it is because ideas come into being as a responseto new or better-phrased questions"We rememberthat accordirg to Plato, Socratesdrew a geometrical proof from a slave simply by means of an ingenious sequenceof questions.Those who have not seriouslyasked them- selvesa pafticular questionare not likely to have come upon the proper answer,and, sometimes,when the questionis phrasedwith the answerin mind, the answerappears to be obvious. Those interestedin communication have been aware from the very beginnitg that communication circuits or channels are im- perfect. In telephonyand radio, we hear the desired signal against a background of noise,which may be strong or faint and which may vary in quality from the cracklirg of static to a steadyhiss.

t45 r46 Symbols, Signals and f'{oise In TV, the picture is overlaid faintly or strongly with an ever- 'osnow." changing granular In teletypewriter transmission,the receivedcharacter may occasionallydiffer from that transmitted' Supposethat one had questioned a communication engineer have aboui this generalproblem of "noise" in 1945.one might ..What have asked, can one do about noise?"The engineermight .,you the answered, can increasethe transmitter power or make to receiverless noisy. And be sure that the receiveris insensitive disturbanceswith frequenciesother than the signaf frequel:t..:." o'Can't One might have peisisted, one do anything else?"The 'owell engine., ,iight hav^.urrr*ered, by using frequencymody- latlon, which takes a very large band width, one can reducethe effect of noise." Suppose,howevero that one had asked,"In teletypewritersys- tems, noisemay causesome receivedcharacters to be wrong;how can one guardagainst this?" The engineercould and might perhaps have answered,r'Ikrrow that if I usefive off-or-on pulsesto repre- sent a decimal digit and assign to the decimal digits only such when sequencesas all hive two ons and three offsoI can often tell an error has been made in transmissioo,for when errors are made the receivedsequence may have other than 2 ons." One might havepursued the matter further with, "If the teletype- writer cirduit doescause errors is there any way that one can get have the correct message to the destination?"The engineermight .,I but answered, suppor. you can if you repeatit enough times, that's very wasteful.You'd better fix the circuit." Here we are getting pretty closeto questionsthat just hadn't been askedbefoie Shtnnon askedthem. Nonetheless,let us go on and imagine that one had said, "S,rpposethat I told you that by propert/encoding my message,I can send it over even a noisy Lttutttteiwith u compfetely r,egtigiblefraction of errors' a fraction smaller than any assignuUt.uatni. Supposethat I told you that, if the sort of noisein t[e channel is known and if its magnitudeis known, I can calculatejust how many charactersI can sendover the channelper second'andthat, if I sendany number fewerthan this, I can do so virtually without error, while if I try to sendmore' I will be bound to make errors." The engineermight well have answered,"You'd surehave to The Noisy Channel t47

show me. I never thought of things in quite that way before, but what you say seemsextremely improbable. Why, every time the noise increases, the error rate incteases. Of course, repeating a messageseveral times does work better when there aren't too many errors. But, it is always very costly. Maybe there's something in what you sa/, but I'd be awfully surprised if there was. Stiil,1he way you put it . . ." Whatever we may imagine concerning an engineer benighted in the days of error, mathematicians and engineers who have survived the transition all feel that Shannon's results concerning the trans- mission of information over a noisy channel were and still are very surprising. Yet I have known an intelligent layman to see nothing remarkable in Shannon's results. What is one to think ofthis? Perhaps the best course is merely to describe and explain the problem of the noisy channel as we now understand it, raisirg and answering questions that, however natural and inevitable they now seem, belong in their trend and content to the post-shannon era. The reader can be suqprisedor not as he chooses. So far we have discussedboth simple and complex means for encoditg text and numbers for efficient transmission. We have noted further that any electrical signal of limited band width W can be represented by 2W amplitudes or samples per second, measured or taken at intervals | /2W secondsapart. We have seen that, by means of pulse code modulation, we can use some num- ber, aroun d 7, of binary digits to represent adequately the ampli- tude of any sample. Thus, by using pulse code modulation or some more complicated and more efficient scheme, we can transmit spee_c-hor picture signals by means of a sequence of binary digits or off-or-on or positive-or-negative pulses of current. All of this works perfectly if'the recipient of the messagereceives the same signal that the sender traqsmits. The actual facts are dif- ferent. Sometimes he receives a d when a I is transmitted, and sometimes he receivesa I when a 0 is transmitted. This can hup- p:tt through the malfunction of electrical relays in a slow-speeO telegraph circuit or through the malfunction of vacuum tubes or transistors in a higher speed circuit. It can also happen because of interfering signals or noise, either noise from man-made apparatus, or noise from magnetic storms. 148 Symbols, Signals and Noise We can easi$ seein a simple casehow elrors can occur because we want to of the admixture of noise with a signal. Imagine that overa wire send a largenumber of binary digits, 0 or 1,per second by meansof an electrical signal.we may represent:* signal99i- VIII-I, veying thesedigits by the successionof sampl-ess of Figure eachofwhichwillbe+1or_l.HerewehaveaSuccessionof 1 0 1 1 positive and negativevoltages which representthe digits 10010. either Now supposea random noise voltage, which may be this positive or negative,is added to the signal.we can represent simul- also by a ,rrr-lrr of noise samplesn ofrigure vIII-l taken noise is taneously with the signal samples.The iignal p-lus the is shown obtain.o uy adding th; signal ano rhe noise samplesand ass +ninFigureVIII-I. mes- If we interpret a positive signal-plus-noisein the received the received sageas a I and a negativesignit-plus-noise as a 0, then

s+n

n

ERRORS x

POSITION 345 Fig.VIil-I The Noisv Channel r49 messagewill be representedby the digits r of Figure VIII- l. Thus, errorsin transmissior,as indicated,occur in positions2,3, and'1-. The effect of such errors in transmissioncan range from annoy- itg to dangerous.In speechor picture transmissionby meansof simple coding schemes,they result in clicks, hissing noises,or 'osnow." ff more efficient,block encodingschemes are used(hyper- quantization) the effectsof errors will be more pronounced.In general,however, we may expect the most dangerouseffects of errors in the transmissionof text. In the transmissionof Englishtext by conventionalmeans, errors merely put a wrong letter in here and there. The text is so redun- dant that we catch sucherrors by eye.However, when type is set remotely by teletypewriter signalso&S it is, for instance,in the simultaneousprintirg of news magazinesin severalparts of the country, evenerrors of this sort can be costly. When numbersare senterrors are much more serious.An error might change$ 1,000into $9,000.If the error occurred in a pro- gram intendedto make an electroniccomputer carry out a com- plicated calculation,the error could easily causethe whole calcu- lation to be meaningless. Further,we haveseen that, if we encodeEnglish text or any other signal very efficiently,so as largely to remove the redundancy,an error can cause a gross changein the meaning of the received signal. When errors arevery important to us, how indeedmay we guard againstthem? One way would be to send every letter twice or to send every binary digit used in transmitting a letter or a number twice.Thus, in transmittingthe binary sequenceI 0 I 0 0 I I 0 l, we might sendand receiveas follows:

sent I 1001 100001 I I 100 I I received I I 0 0 I I 0 0 0 I I I I I 0 0 I I X error For a given rate of sendingbinary digits, this will cut our rate of transmitting information in half, for we have to pauseand retrans- mit everydigit. However,we can now seefrom the receivedsignal than an error has occurredat the marked point, becauseinstead r50 Symbols, Signals and l{oise unlike of apair of like digits,0 0 or I l, we havereceived a pair of pair digiis, 0 l. We don't know whether the correct,transmitted not was 0 0 or I l. We have detectedthe error, but we have corrected tt. If errors aren't too frequent,that is, if the chanceof two errors negliBlblg' occurring in the transmissionof threesuccessive digits is digit we can correct as well as detectan error by transmittitg each three times, as follows: sent 111000111 000000111111 received111000101 000000ll1lll

EITOT we We have now cut our rate of transmissionto one-third,because we can have to pauseand retransmit each digit twice. However, in the now correct the error indicated by thL fact that the digits there indicated group I 0 I are not all the same.If we assumethat then was only Jr. ,r.o, in the transmissionof this group of digits, 1,rather the transmitted group must havebeen I 1 1,representing than 0 0 0, rePresenting0. digits we seethat u ,r.ry siriple schemeof repeatirg transmitted But can detect or even correct infrequent errors of transmission' detec- how costly it isl If we use this m-eansof error correctionor correct we tion, even when almost all of the transmitted digits are digits in have to cut our rate of transmissionin half by repeating of trans- order just to detect errors,and we have to cut our rate in order mission to one-third by transmitting eachdigit threetimes work if to get error correction. Moreourt, these schemeswon't errors are frequent enough so that more than one will sometimes occur in the tfansmission of two or three digits. under- clearly, this simple approach will never lead to a sound is requiredis standing of the possibil y of error correction.What just what Shan- a deep and po*Lrful maihematical attack. This is theorem non provided in discoveringand provinq fundamental lir we are for the noisy channel. It is the course of his reasonittgthat about to follow. errors' In formulating an abstractand generalmodel of noiseor sYstem we will deal with the case of a discrete communication The Noisy Channel l5l which transmits somegroup of characters,such as the digits from 0 to 9 or the lettersof the alphabet.For convenience,let us consider a systemfor transmitting the digits 0 through 9. This is illustrated in Figure VIII-Z. At the left we have a number of little circles labeledwith the digits; we may regard theselittle circlesas push- buttons. To *he right we have a number of little circles, again labeledwith the digits. We may regard theseas lights. When we push a digit button at the transmitter to the left, some digit light lights up at the receiverto the right. If our communication system were noiseless,pushing the 0 button would alwayslight the 0 light, pushing the I button would alwayslight the I light, and so on. However, in an imperfector noisy communicationsystem, pushirg the 4 buttor, for instance, may light the 0 light, or the I light, or the zlight, or any other light, as shownby the linesradiatirg from the 4 button in Figure VIII-2. In a simple,noisy cornmunicition system,w€ can say that when we press a button the light which lights is a matter of chance,

Fig. VIII-2 t52 Symbols, Signals and Noise 4 button is independent of what has gon:-before and that, if the light' p.rrrrd, there is somepto-bubility p+(6) that the 6 light will and so on. If the sendercan't be surewhich light will light when he presses be sure a particular button, then the recipient of.the messagecan't T{u which button was pressedwhett a particular light lights. It the left' indicated by the urio*, from light 6to various buttons on thlt If, for instance,light 6 lights, tf,ere is some probability po @ will button 4 was pr.ir.d, arid so on. only for a noiselesssystem ps(6)be unity and ps(4),P69):-etc', be zeto' ' if all The diagrlm or Fig;;; vfir-z would be too complicated ptgb_ubilitiesis too possibrearrows werefut in, and the number of and greatto list, but I beiievethat the generalidea of the degree sender nature of uncertainty of the character receivedwhen the of the tries to send a particular character and the uncertainty character sent when the recipient receivesa particular character, have been illustrated. Let us now consider this noisy communica- represent tion channel in a rather generalway. In doing so we will received' by all of the characterssent and ai y all of the characters The" charactersx arejust the charactersgenerated by the message m of these source from which th; messagecomes. If there are p(x), charactersand if they occur independently with probabilities message then we know fro* ittupter v that the entropy H(x) of the source,the rate at whi.h the messagesource generates information, must be m H(x) x- I by we can regard the output of the device,which we designate not be y, as anothei messagesource. The number of lights need it is, so equal to the numbei of buttons, but we will assumethat that there arem li$rts. The entropy of the output will be m H (Y) y:l com- we note that while H(*i dependsonly on the input to the channel munication chann el,H() depinds both on the rnput to the of and on the errors madi in tiansmission. Thus, the probability The lr{oisy Channel 153 receivitg a 4 if nothirg but a 4 is ever sent is different from the probability of receiving a 4 if transmitting buttons are pressed at random. If we imagine that we can see both the transmitter and the receiver,we can observehow often certain combinations of x and / occur; s&), how often 4 is sent and 6 is received.Or, knowing the statisticsof the messagesource and the statisticsof the noisy channel,we can compute such probabilities.From thesewe can compute anotherentropy. mm H(x,y)- (8.3) X:l x-l This is the uncertaintyof the combination of x andy. Further, we can s&), supposethat we know x (that is, we know what k.y waspressed). What arethe probabilities of various lights lighting (asillustrated by the arrows to the right in Figure VIII-2X This leadsto an entrop), (m HrU) = (8.4) x:ly-l This is a conditionalentropy of uncertainty. Its form is reminis- cent of the entropy of a finite-statemachine. As in that case,we multiply the uncertaintyfor a given condition (state,value of x) by the probability that that condition (state,value of x) will occur and sum over all conditions(states, values of x). Finally, supposewe know what light lights. We can saywhat the probabilities arethat variousbuttons were pressed.This leadsto another conditionalentropy nxm Hu@)- (8.5) y-lx-l This is the sum overy of the probability that,y is receivedtimes the uncertaintythat x is sent when,y is received. TheSe conditional entropies depend on the statistics of the messagesource, because they dependon how often x is transmitted or how often 7 is received, as well as on the errors made in transmission. 154 Symbols, Signalsand Noise The entropieslisted above are best interpreted as uncertainties and the involving th; charactersgenerated by the messagesorlrce charactersreceived by the recipient. Thus: will H(x)is the uncertainty as to x, thatis, as to which character be transmitted. be received H(y)is the uncertainty as to which characterwill in the caseof a given messagesource and a given communication channel. and H(x,/) is the uncertainty as to when x will be transmitted y received. It Hr(y)is the uncertaintyof receivingT wherlx is transmitted' be received' is the averageuncertainty of the senderas to what will when is H is the uncertuirrty that x was transmitted 7 u@) recipientas received.It is the averageuncertainty of the message to what was actuallYsent. There arerelations among thesequantities: H(x,y) = H(x) + H-(Y) (8.6) y is That is, the uncertainty of sending x and receiving .the of receiving when uncertainty of sending pt.* the uncJrtainty 7 " x is sent. H(x,y) : H(Y) + Ho@) (8.7) sending x is That is, the uncertainty of receivLn| y and -the that x was sentwhen uncertainty of receivingTitor the onc.riainty was received. / zero,and H(y) We seethat when H-(y) is zero, Hr(x) must be channel,for is then just H(x).This is the case or the noiseless just the.same as the which the entropy of the received signal ir just what will entropy of the transmitted signal.ThJsenderknows knows just what be received, and the recipiJnt of the message was sent' '.tted when a The uncertainty- ^ - t-^r^t-r asA^ to*a.',Linlr which symbo.lo.rtnhnl was*, transmt a natural measure given symbol is received,that rs,H o(x) seems this proves to be of the information lost in transmission.Indeed, a special-name; the case,and the quanttty Ho(x)has been given channel'If we it is called the equivocationof the communication The Noisy Channel 155

take H(x) and H u@)as entropiesin bits per second,the rate R of transmissionof information over the channel can be shown to be, in bits per second, R - H(x) Hu@) (8.8) That is, the rate of transmissionof information is the sourcerate or entropy lessthe equivocation.It is the entropy of the message as sent less the uncertainty of the recipient as to what message was sent. The rate is also given by (8.9)

That is, the rate is the entropy of the received signal 7 lessthe uncertaintythat y was receivedwhen )cwas sent. It is ttre entropy of the messageas receivedless the sender'suncertainty as to what will be received. The rate is also given by

(8.10) The rate is the entropy of x plus the entropy of y less the uncer- tainty of occurrenceof the combinationx andy. We will note from 8.3 that for a noiselesschannel, since p (x,/) is zeroexcept when - x /, and H(x, y) = H(x) - H(y).The information rate is just the entropy of the information source,H(x). Shannonmakes expression 8.8 for the rate plausible by means of the sketchshown in Figure VIII-3. Here we assumea systemin which an observercompares transmitted and received signalsand then sends correction data by means of which the erroneous receivedsignal is corrected.Shannon is able to show that in order to correct the message,the entropy of the correction signalmust be equal to the equivocation. We see that the rate R of relation 8.8 depends both on the channel and on the messagesource. How can we describethe capacityof a noisy or imperfect channel for transmitting informa- tion? We can choosethe messagesource so as to make the rate R as large aspossible for a given channel. This maximum possible tate of transmissionfor the channel is called the channelcapacity 156 Symbols, Signalsand Noise

CORRECTION DATA

OBSERVER

CORRECTING TRANS- RECEIVER DEVICE MITTER

Fig.VIII-3

involves c. shannon,s fundamental theorem for a noisy channel the channel caPacitYC. It saYs: the Let a discretechannel have a capacityc and a discretesource suchthat the entropyper second H. If H < c theri existsa codingsystem arbitrarily outputof thesource can be transmittedover the channel with an If H c smallfrequency of errors(or an arbitrarilysmall equivocation). > is lessthan it is porriblett encodethe sourceso that the equivocation of encoding H - C * €,where e is arbitrarilysmall. There is no method which givesan equivocationless than H - c.

This is a precise statement of the result which so astonished become engine"r, urd mathematicians.As errors in transmission channel *J* probable, that is, as they occur more frequentlyl the instance' capacity u,,defined by Shannon goesdown' For -glldually error, the if our systemtransmits binaty digits and if some are in we can chann"i ,upacity c, that is, n-umberof bits of information channel send per ui"uty digit transmitted, decreases.But the of digits capacity d..reases giaduatly as the errors in transmission few errors become more freqGnt. To achieve transmissionwith as of trans- as we may care to specify, we have to reduce our rate mission ,o that it is equal-to or less than the channel capaclty. in effi- How are we to urhiru, this result? we remember that to ciently encoding an information source, it is necessaty 9*p a 1o."g -urry characterstogether and so to encode the messaEe a noisy block of charactersui utime. In making very efficient useof received channel, it is also necessary to deal *ittt sequencesof The Noisy Channel r57 characters,each many characterslong. Among such blocks, only certaintransmitted and receivedsequences of characterswill o.r.r, with otherthan a vanishingprobability. In proving the fundamentaltheorem for a noisy channel,Shan- non finds the averagefrequency of error for all possiblecodes (for all associationsof particularinput blocksof characterswith partic- ular output blocks of characters),when the codes arechosen at random, and he then shows that when the channel capacity is greaterthan the entropyof the source,the error rate averagedover all of theseencodirg schemesgoes to zero as the block length is made very long. If we get this good a result by averagirg over all codeschosen at randoffi,then theremust be someone of the codes which givesthis good a result. One information theorist has char- acterrzedthis mode of proof as weird. It is certainly not the sort of attack that would occur to an uninspired mathematician.The problem isn't one which would have occurred to an uninspired mathematician,either. The foregoitg work is entirely general,and hence it appliesto all problems.I think it is illuminating, however, to return to the example of the binary channel with errors, which we discussed early in this chapterand which is illustrated in Figure VIII- l, and see what Shannon'stheorem has to say about this simple and commoncase. Supposethat the probability that over this noisy channela 0 will be receivedas a 0 is equal to the probabtlity p that a I will be receivedas a l. Then the probability that a I will be receivedas a 0 or a 0 as a I must be (l - p).Supposefurther that theseprob- abilities do not dependon past history and do not changewith time. Then, the proper abstractrepresentation of this situationis a symmetricbinary channel(in the manner of Figure VIII-2) as shown in Figure VIII-4. Becauseof the symmetryof this channel,the maximum infor- mation rate, that is, the channel capacity,will be attained for a messagesource such that the probability of sending a I is equal to the probability of sending a zero.Thus, in the caseof x (and, becausethe channelis symmetrical,in the caseof y also)

p(r)-p(0)-V2 158 Symbols,Signals and Noise We alreadyknow that under thesecircumstances H(x) - H(y) : (Vzlog t/z + VzLog Vz) - I bit per syrnbol What about the conditional probabilities? What about the equivocation,for instance,as given by S.5?Four terms will con- tribute to this conditional entropy. The sourcesand contributions are: The probability that I is receivedis r/2.When 1 is received,the probability that I was sent is p and the probability that 0 was sent is (1 - p). The contribution to the equivocationfrom these eventsis: Vr(-pLogp (l -p) log(l -p)) There is a probability of Vzthat 0 is received.When 0 is received, the probability that 0 was sent is p and the probability that I was sent is (1 - p). The contribution to the equivocationfrom these eventsis: '/r(-plogp (1 -p) log(l -p)) Accordingly, wo seethat, for the symrnetricalbinary channel,the equivocation,the sum of these terms, is Hr(x) - -p logp (l - p) log (l - p) Thus the channel capacity C of the symmetrical binary channel is, from 8.8, C - I + plogp + (l - p) log(l - p)

P Fig. VIII.4 The Noisy Channel 159 We shouldnote that this channelcapacity C is just unity lessthe function plotted againstp rn Figure V-1. We seetfrat if p-rsVz, the channelcapacity is 0. This is natu ral, for in this case,if we receive a l, it is equally likely that a I or a 0 was transmitted, and the receivedmessage does nothing to resolveour uncertainty as to what digit the sendersent. We should also note that the ihannel capacity is the same for p - 0 as for p : l. If we consistently receivea 0 when we transmit a I and i I when we transmit u d, we arejust as sure of the sender'sintentions as if we always geta I fora I anda0fora0. If' on the average,I digit in l0 is in error, the channelcapacity is reducedto .53of its valuefor errorlesstransmissior, and fotor. error in 100digits, the channel capacity is reduced to .92 merely. The writer would like to testify at this point that the simpli ciiy of the result we have obtained for the symrnetricalbinary rhutttt.l is in a sensemisleading (it was misleadingto the writei at least). The expressionfor the optimum rate (channel capacity) of an unsymmetricalbinary channel in which the probability that a I is receivedas a I it P and the probability that a 0 is receivedas a 0 is a different number g is a mess,and more complicatedchannels must offer almostintractable problems. Perhapsfor this reason as well as for its practical importance, much considerationhT been given to transmissionovefthe sym- metrical binary channel.What sort of codesare we to usein oider to attain errorlesstransmission 'W. over such a channel? Examples devisedby R. Hammitg were mentioned by Shannon in fris original paPer. Later, Marcel J. E. Golay pubfished concerning error-correctingcodes in 1949,and Hammirg published his wort in 1950.We shouldnote that thesecodes were devisedsubsequent to Shannon'swork. They might, I suppose,have been devised before, but it was only when Shannon strowed error-free trans- missionto be possiblethat peopleasked, "How can we achievett?" We have noted that to gtt in efficientcorrection of errors, the encodermust deal with a long sequenceof messagedigits. As a simpleexample, suppose we encodeour messagedi$ts in blocks of 16 and add after eachblock a sequenceof. checE aigitt which enable us to detecta singleerro1 itt any one of the digits, messagedigits or check digits. As a particular example,consider the sequenceof 160 Symbols, Signals and Noise messagedigitsll0l00l 10101 100 O.Tofindthe appropriate check digits, we write the 0's and I's constitutingthe messagedigits in the 4 by 4 grtd shown in Figure VIII-5. Associ- ated with eachrow and each column is a circle. In each circleis a 0 or a I chosenso as to make the total number of I's in the column or row (includirg the circle as well as the squares)eVen. Such added digits arecalled checkdigits. For the particular assort- ment of messagedigits used as an example, together with the appropriatelychosen check digits, the numbersof l's in successive columns (left to right) and 2, 2, 2, 4, all being even numbers,and the numbersof, I's in successiverows (top to bottom) are4,2,2,2, which are a1aunall even. What happensif a singleerror is made in the transmissionof a messagedigit among the 16?There will be an odd number of ones in a row and in a column This tells us to changethe messagedigit where the row and column intersect. What happensif a single error is made in a check digit? In this case there will be an odd number of ones in a row or in a column. We have detectedan error, but we seethat it was not among the messagedigits. The total number of digits transmittedfor 16 messagedigits is 16 + 8, or 24; we have increasedthe number of digits neededin the ratio 24/16, or 1.5.If we had startedout with 400 message digits, we would have needed40 check digits and we would have increasedthe number of digits neededonly in the ratio of 440/400

Fig. ,'III-5 The Noisy Channel 16l or l.l. Of course, we would have been able to correct only one error rn 440 rather than one error rn 24. Codes can be devised which can be used to correct larger num- bers of errors in a block of transmitted characters. Of course, more check digits are needed to correct more errors. A final code, how- ever we may devise it, will consist of some set of 2M blocks of 0's and I's representing all of the blocks of digits M digits long which we wish to transmit. If the code were not error correcting, we could use a block just M digits long to represent each block of M digits which we wish to transmit. We will need more digits per block because of the error-correctirg feature. When we receive a given block of digits, we must be able to deduce from it which block was sent despite some numb er n of errors in transmission (changes of 0 to I or I to 0). A mathema- tician would say that this is possible if the distance between any two blocks of the code is at least 2n + l. Here distance is used in a queer senseindeed, &Sdefined by the mathematician for his particular purpose. In this sense, the dis- tance between two sequencesof binary digits is the number of 0's or I's that must be changed in order to convert one sequenceinto theother.Forinstance,thedistancebetweenO0 I 0andl I I I is 3, because we can convert one sequence into the other only by changirg three digits in one sequenceor in the other. When we make n errors in transmission, the block of digits we receive is a distance n from the code word we sent. It may be a distance n digits closer to some other code word. If we want to be sure that the received block will always be nearer to the correct code word, the one that was sent, than to any other code word, then the distance from any code word to any other code word must be at least 2n * L. Thus, one problern of block coding is to find 2M equal length code words (longer than M binary digits) that are all at least a distance 2n + I from one another. The code words must be as short as possible. The codes of Hamming and Golay are efficient, and other efficient codes have been found. Another problem of block coding is to provide a feasible scheme for encoding and, especially, for decoding. Simply listing code words won't do. The list would be too long. Encoditg blocks of.20 binary digits (M - 20) requires around a million code words. And, r62 Symbols, Signals and Noise finding the code word nearest to some received block of digits would take far too long. Algebraic codi.g theory provides meansfor coding and decoding with the correction of many errors. Slepianwas a pioneer in this field and important contributors can be identified by the names of types of algebraic codes: Reed-Solomon codes and Bose-Chaudhuri' Hocquenghem codes provide examples. Elwin Berlekamp con- tribuied greatly to mithematical techniques for calculating the nearest code word more simPlY. Convolutional codes are attother means of error correction. In convolution coding, the latest M digits of the bin aty stream to be sent are stored in what is called a shift register. Every time a new binary digit comes in,2 (or 3, or 4) are sent out by the coder. The digits rcnI out are produced by what is called modulo 2 addition of uuiiour digits stored in the shift register. (In modulo 2 addition of binary numbers one doesn't "carry.") Convolutional encoding has been traced to early ideas of Elias, but the earliest coding and decoding scheme published is that in a patent of D. W. Hagdbutger, filed in 1958. Convolutional decoding ieally took off in lg6i when Andrew J. Viterbi invented an optimum and simple decoding scheme called maximum likelihood decoditg. Toda!, convolutional decoding is used in such valuable, noisy communication channels as in sending pictures of Jupiter and its satellites back from the Voyager spacecraft. Convolutional coding is particularly valuable in such applications because Viterbi's maiimum hk;lihood decoding can take advantage of the strength as well as the sign of a received pulse. If we receive a very small positive pulse, it is almost as likely to be a negative pulse plus noise as it is to be a positive pulse plus noise. Butl if we receive a large positive pulse, it is much likelier to be a positive pulse plus noise than a negative pulse plus noise. Viterbi.decoding can take advantage of this. Block coding is used in protecting the computer storage of vital information. It can also be used in the transmission of binary in- formation over inherently low-noise data circuits. Many existing circuitJ that are used to transmit data are subject to long bursts of noise. When this is so, the most effective form of The NoisyChannel 163

error correction is to divide the messageup into long blocks of digits and to provide foolproof error detecti,on.If an error is de- tectedin a receivedblock, retransmissionof the block is requested. Mathematiciansare fascinatedby the intricaciesand chattenges of block coding. In the eyesof some,information theory has Ur- come essentillly algebraiccoding theory. Codirg theory is im- portant to informationtheory. But, in its inception, in Shannon's work, informationtheory was, as we haveseen, much broader.And even in coding itself, we must considersource coding as well as channelcoding. In Chapter VII, we discussedways of removing redundancy from a messageso that it could be transmitted by t*anE of fewer binary digits. In this chapter, we have considetld the matter of adding redundancy to a nonredundant messagein order to attain virtually error-freetransmission over a noisy channel. The fact that such error-freetransmission can be attained using a noisy channel was and is surprisitg to communication enginiers and mathe- maticians' but Shannonhas proved that it is necessarilyso. Prior to receiving a messageover an error-free cliannel, the recipient is uncertain as to what particular messageout of many possible messagesthe senderwill actually transmit. The amount of the recipient'suncertainty is the entropy or information rate of the messagesource, measured in bits p.i symbol or per second. Tl: recipient'suncertainty as to what messagethe messagesource will send is completelyresolved if he receivesan exact replica of the messagetransmitted. A messagemay be transmitted by meansof positive and nega- tiu. pulsesof current. If a strong enough noise consistingof rin- dom positive and negative pulsei is added to the signal, iporitive signal pulse may be changedinto a negativepulsei or a negative signal pulse may be changedinto a positive pulse. When Juch a noisy channel is used to transmit the message,if the sendersends any particular lYmbol there is some uncertainty as to what symbol will be receivedby the recipient of the message. When the recipient receivesa message over a noisy channel,he knows what messagehe has received,but he cannot ordinarily be sure what messagewas transmitted. Thus, his uncertainty as to what messagethe senderchose is not completely resolvedeven on 164 Symbols, Signals and Noise the receipt of a message.The remaining uncertainty dependson the probability that a received symbol will be other than the symbol transmitted. From the sender'spoint of view, the uncertaintyof the recipient as to the true messageis the uncertainty,or entropy,of the message sourceplus the uncertaintyof the recipientas to what messagewas transmitted whenhe knows what messagewas received. The measure which Shannonprovides of this latter uncertainty is the equivoca- tion, and he definesthe rate of transmissionof information as the entropy of the messagesource less the equivocation. The rate of transmissionof information dependsboth on the amount of noiseor uncertaintyin the channeland on what message source is connectedto the channel at the transmitting end. Let us supposethat we choosea messagesource such that this tate of transmissionwhich we have deflned is as great as it is possibleto make it. This greatestpossible rate of transmissionis calledthe channel capacity for a noisy channel. The channel capacity is measuredin bits per symbol or per second. So far, the channel capacity is merely a mathematicallydefined quantity which we can compute if we know the probabilitiesof various sortsof errorsin the transmissionof symbols.The channel capacrtyis important, becauseShannon proves, &s his fundamental theorem for the noisy channel, that when the entropy or informa- tion rate of a messagesource is lessthan this channelcapacity, the messagesproduced by the sourcecan be so encodedthat they can be transmitted over the noisy channel with an error lessthan any specifiedamount. In order to encodemessages for error-free transmissionover noisy channels,long sequencesof symbolsmust be lumped together and encodedas one supersymbol.This is the sort of block encoding that we have encounteredearlier. Here we are using it for a new purpose. We are not using it to remove the redundancy of the messagesproduced by u messagesource. Instead, we are usingit to add redundancyto nonredundant messagesso that they can be transmitted without error over a noisy channel. Indeed, the whole problem of efficient and error-free communication turns out to be ittat of removing from messagesthe somewhatinefficient redun- dancy which they have and then adding redundancy of the right The Noisy Channel 165 sort in order to allow correction of errors made in transmission. The redundant digits we must use in encodirg messagesfor error-free transmission,of courseoslow the speedof transmission. We have seenthat in using a binary symmetric channel in which I transmitteddigit in 100is erroneouslyreceived, we can sendonly 92 correctnonredundant message digits for each 100digits we feed into the noisy channel.This meansthat on the average,we must usea redundantcode in which, for each92 nonredundantmessage digits, we must include in some way 8 extra check digits thus making the over-allstream of digits redundant. Shannon'svery generalwork tellsus in principle how to proceed. But, the mathematicaldifficulties of treatingcomplicated channels aregreat. Even in the caseof the simple, symmetric,ofron binary channel, the problem of finding efficient codes is formidable, althoughmathematicians have found a largenumber of bestcodes. Alas, eventhese seem to be too complicatedto use! Is this a discouragingpicture? How much wiser we are than in the days beforeinformation theory! We know what the problem is. We know in principlehow well we can do, and the resulthas astonishedengineers and mathematicians.Further, we do have effectiveerror-correcting codes that are used in a variety of appli- cations, including the transmissionback to earth of glamorous picturesof far planets. CHAPTERIX Many Dimensions

Yr^q.nsAND vEARS AGo (over thirty) I found in the public library of St. Paul a little book which introduced me to the mysteriesof the fourth dimension.It was Flatland, by Abbott. It describesa two-dimensional world without thickness.Such a world and all its people could be drawn in complete detail, inside and out, on a sheetof paper. What I now most rememberand admire about the book arethe descriptions of Flatland society. The inhabitants are polygonal, and sidednessdetermines social status.The most exaltedof the multisided creatureshold the honorary statusof circles.The lowest order is isoscelestriangles. Equilateral triangles are a stephigher, for regularity is admired and required. Indeed, irregular children are cracked and reset to attain regulanty, an operation which is frequently fatal. Women ate extremely narrow, needle-likecrea- tures and are greatly admired for their swayingmotion. The author of record, A. Square, accords well with all we have come to associate with the word. Flatland has a mathematical moral as well. The protagonistis astonishedwhen a circle of varying size suddenly appearsin his world. The circle is, of course,the intersectionof a three-dimen- sional creature,a sphere,with the plane of Flatland. The sphere explains the mysteriesof three dimensionsto A. Square,who in turn preaches the strange doctrine. The reader is left with the thought that he himself may somedayencounter a fluctuating and disappearingentity, the three-dimensionalintersection of a four- dimensional creaturewith our world.

166 Many Dimensions r67 Four-dimensionalcubes or /esseracts, hyperspheres, and other hypergeometricforms are old stuff both to mathematiciansand to sciencefiction writers. Supposinga fourth dimension like unto the three which we know, we can imagine many three-dimensional worlds existing as closeto one another as the pagesof a manu- script, each imprinted with different and distinct charactersand each separatefrom every other. We can imagine travelirg through the fourth dimension from one world to another or reaching through the fourth dimensioninto a safeto stealthe bonds or into the abdomento snatchan appendix. Most of us have heard also that Einstein used time as a fourth dimension,and somemay have heard of the many-dimensional phasespaces of physics,in which the three coordinatesand three velocity componentsof each of many particles are all regarded as dimensions. Clearly, this sort of thing is different from the classicalidea of a fourth spatial dimensionwhich is just like the three dimensions of up and down, back and forth, and left and right, those we all know so well. The truth of the matter is that nineteenth-century mathematicianssucceeded in generuhzinggeometry to include any number of dimensionsor even an infinity of dimensions. Thesedimensions are for the pure mathematicianmerely mental constructs.He startsout with a line called the x direction or x Axis, as shownin a of Figure IX- 1. Somepoint p lies a distancexo to the right of the origin O on the x axis. This coordinatexp Ln fact describesthe location of the point p. The mathematiciancan then add a y axisperpendicular to the x axis, &sshown in b of Figure IX- l. He can specifythe location of a point p rn the two-dimensionalspace or plane in which these axeslie by meansof two numbersor coordinates:the distancefrom the origin O in the y directioil, that is, the herght yr, and the distancefrom the origin O in the x dire ction xp, that is, how far p is to the right of the origin O. In c of Figure IX- I the x, /, and z axesare supposedto be all perpendicularto one another,like the edgesof a cube.These axes representthe directionsof the three-dimensionalspace with which we are familiar. The location of the point / is given by its height y, abovethe origin Q its distancexo to the right of the origin O, and its distancezo behind the origin O. 168 Symbols, Signalsand Noise U -r-

9p I I I x* l*---",---f (a) (b)

(C) tu / Fig. IX-l

axes Of course,in the drawing c of Figure IX-l the x, /, and z merely aren,t really all perpendiculir to onJanother. we have here a two-dimensiottal perspective sketch of an actual three-dimen- to one sional situation in which the axes are all perpendicular another. In d of Figure IX-l, we similarly have a two-dimensional we perspectivesketch of axesin a fi,ve-dimensionalspace. Since we have come to the end of the alphabet in going from x to z, to the merely labeled thesedirections xt, x2, xB,x4, x5, Llcording practice of mathematicians. Of coursethese five axesof d of Figure IX-l arenot all peqpen- Many Dimensions r69 dicular to one another in the drawing, but neither are the three axesof c. We can't luy out five mutually perpendicularlines in our three-dimensionalspace, but the mathematiciancan deal logically with a "space" in which five or more axes are mutually perpen- dicular. He can reasonout the propertiesof various geometrical figuresin a five-dimensionalspace, in which the position of a point p is describedby five coordinatesxLp, xzp, xup, x4p, xsp.To make the spacelike ordinary space(a Euclideanspace) the mathematician saysthat the squareof the distanced of the porntp from the origin shall be given by

d2 = *rr, + xzpz + xspz + x4pz + xbpz (9.1) In dealingwith multidimensionalspaces, mathematicians define oocubical" the "volume" of a figure as the product of the lengths of its sides.Thus, in a two-dimensionalspace the figure is a square, and, if the length of eachside is I, the "volume" is the areaofthe square,which is L2. ln three-dimensionalspace the volume of a cube of width, height, and thicknessZ is le. In five-dimensional spacethe volume of a hypercube of extent L in each direction is L5, and a ninety-ninedimensional cube L on a side would have a volumeLee. Someof the propertiesof figuresin multidimensionalspace are simple to understandand startling to consider.For instance,con- sider a circle of radius I and a concentric circle of radtvs t/zinside of it, as shown in Figure IX-2. The area("volume") of a circle is nrz, so the areaof the outer circle is n and the areaof the inner

li - '.A) ( \/

Fig. IX-2 170 Symbols, Signals and Noise circle is z (Vr), = (Vo)z.Thus, & quarter of the areaof the whole circle lies within a circle of half the diameter. Suppose,however, that we regard Figure IX-z as representing spheres.The volume of a sphereis (lt)nF!, and we find that Vaof the volume of a spherelies within a sphereof Vzdiameter. In a similar w&), the volume of a hypersphereof r dimensionsis pro- portional to rn, and as a consequencethe fraction of the volume which lies in a hypersphereof half the radius is r/2n.For instance, for n_ 7 this is a fractionI/128. We could go through a similar argumentconcerning the fraction of the volume of a hypersphereof radius r that lies within a sphere of radius 0.99r.For a l,OO0-dimensionhypersphere we find that a fraction 0.00004of the volume lies in a sphereof 0.99 the radius. The conclusionis inescapablethat in the caseof a hypersphereof a verv high dimensionality,essentially all of the volume lies very near to the surface! Are such ideas anything but pure mathematics of the most esoteric sort?They are pure and esotericmathematics unless we attach them to some problem pertainitg to the physical world. Imaginary numbers,such as \f, oncehad no practical physical meaning. However, imaginary numbers have been assignedmean- ings in electrical engineeringand physics.Can we perhapsfind a physical situation which can be representedaccurately by the mathematical propertiesof hyperspace?We certainly can,right in the field of communicationtheory. Shannonhas used the geometry of,multidimensional spaceto prove an important theoremconcern- ing the transmissionof continuous, band-limited signals in the presenceof noise. Shannon'swork provides a wonderful exampleof the useof a new point of view and of an existing but hitherto unexploited branch of mathematics(in this case,the geometryof multidimen- sional spaces)in solving a problem of great practical interest. Becauseit seemsto me so excellentan exampleof applied mathe- matics, I proposeto go through a good deal of Shannon'sreason- ing. I believe that the courseof this reasonitg is more unfamiliar than difficult, but the reader will have to embark on it at his own peril. Many Dimensions t7l In order to discussthis problem of transmissionof continuous signals in the presenceof noise, we must have some common measureof the strengthof the signaland of the noise.Power turns out to be an appropriateand useful measure. When we exeft a force of I lb over a distanceof I ft in raising a I lb weight to the heightof I ft we do work The amount of work done is I Joot-pound(ft-1b).The weight has,by virtue of its height, an energ/ of I ft-lb. In fallitg, the weight can do an amount of work (as in driving a clock) equal to this energy. Power is rate of doing work. A machinewhich expends33,000 ft-lb of energy and does33,000 ft-lb of work in a minute has by definition a power of I horsepower(hp). In electricalcalculations, we reckon energyand work in terms of a unit called thejoule andpower in terms of a unit called a watt. A watt is onejoule per second. If we double the voltageof a signal,we increaseits energyand power by u factor of 4. Energy and power are proportional-to the squareof the voltageof a signal. We have seen as far back as Chapter IV that a continuous signal of band width W can be representedcompletely by its amplitude at 2W samplepoints per second.Conversely, we can construct a band-limited signal which passesthrough any 2W samplepoints per secondwhich we may choose.We can specify each samplearbitrarily and changeit without changinganyother sample.When we so changeany samplewe changethe correspond- irg band-limited signal. We can measurethe amplitudesof the samplesin volts. Each sample representsan energy proportional to the square of its voltage. Thus, we can expressthe squaresof the amplitudes of the samplesin termsof energy.By usingrather specialunits to measure energy,we can let the energybe equal to the squareof the sample amplitude,and this won't lead to any troubles. Let us, then, designatethe amplitudes of successiveand cor- rectly chosensamples of a band-limited signal,measured perhaps in volts,by the lettersxL, xz, x3, etc.The partsof the signalenergy representedby the sampleswill be xL2, xzz, xr2, etc. The total 172 Symbols, Signals and Noise energy of the signal, which we shall call E, will be the sum of theseenergies: E: xt2 + xz2+ xs2+etc. (9.2) But we seethat in geometricalterms E is just the squareof the distancefrom the origin, &Sgiven by 9.1,if xL,xz, x3, etC., ate the coordinatesof a point in multidimensional space! Thus, if we let the amplitudes of the samplesof a band-limited signal be the coordinatesof a point in hyperspace,the point itself represents the complete signal, that is, all the samplestaken together,and the squareof the distanceof the point from the origin representsthe energyof the completesignal. Why should we want to representa signal in this geometrical fashion? The reason that Shannon did so was to prove an impor- tant theorem of communication theory concerning the effect of noise on signal transmission. In order to see how this can be done, we should recall the mathematical model of a signal source which we adopted in Chapter III. We there assumedthat the sourceis both stationary and ergodic.These assumptions must extend to the noisewe con- 'osource" sider and to the combined of signal plus noise. It is not actually impossible that such a sourcemight produce a signal or a noise consisting of a very long successionof very high- energy samplesor of very low-energy samples,any more than it is impossible that an ergodic source of letters might produce an extremetylong run of E's. It is merely very unlikely. Here we ate dealing with the theorem we encounteredfirst in Chapter V. An ergodic sourcecan produce a classof messageswhich areprobable and a classwhich are so very improbable that we can disregard them. In this case,the improbable messagesare those for which the averagepower of the samplesproduced departs significantly from the time average(and the ensemble average)characteristic of the ergodic source. Thus, for all the long messagesthat we need to consider,there is a meaningful averagepower of the signal which doesnot change appreciably with time. We can measure this averagepower by adding the energiesof a large number of successivesamples and dividing by the time Z during which the samples are sent.As we Many Dimensions 173 make the time Z longer and longer and the number of samples larger and larger,we will get a more and more accuratevalue for the averagepower. Becausethe sourceis stationary, this average power will be the same no matter what successionof samples we use. We can say this in a different way. Except in casesso unlikely that we neednot considerthem, the total energyof a largenumber of successivesamples produced by a station ary source will be nearly the same(to a small fractional difference)regardless of what particular successionof sampleswe choose. Becausethe signalsource is ergodicas well as stationery, we can say more. For eachsignal the sourceproduces, regardless of what the particular signal is, it is practically certain that the energyof the same large number of successivesamples will be nearly the same,and the fractional differencesamong energiesget smaller and smaller as the number of samplesis made larger and larger. Let us representthe signalsfrom such a source by points in hyperspace.A signal of band width W and duration T can be representedby 2WT samples,and the amplitude of eachof these samplesis the distancealong one coordinate axis of hyperspace. If the avera9eenergy per sampleis P, the total energyof the zWT sampleswill be very closeto 2WTP if 2WT is a very largenumber of samples.We have seenthat this total energy tells how far from the origin the point which representsthe signal is. Thus, as the number of samplesis madelarger and larger, the points represent- itg different signalsof the sameduration produced by the source lie within a smaller and smaller distance (measuredas a fraction of the radius) from the surfaceof a hypersphereof radius 1pWW. The fact that the points representingthe different signalsall lie so close to the surfaceis not surprising if we remember that for a hypersphereof high dimensionalityalmost all of the volume is very close to the surface. We receive,not the signalitself, but the signal with noiseadded. The noise which Shannonconsiders is called white Gaussiannoise. The word white implies that the noise contains all frequencies equally, and we assumethat the noise contains all frequencies equally up to a frequencyof W cyclesper second and no higher frequencies.The word Gaussianrefers to a law for the probability t74 Symbols, Signals and Noise of samples of various amplitudes, a law which holds for many natural sourcesof noise.For such Gaussiannoise, each of the 2W samplesper secondwhich representit is uncorrelatedand inde- pendent. If we know the averageenergy of the sampleswhich we will call 1/, knowing the energyof some samplesdoesn't help to predict the energyof others.The total energyof 2WT sampleswill be very nearly ZWTN if zWT rs a large number of samples,and the energy will be almost the same for any successionof noise samplesthat are added to the signal samples. We have seenthat a particular successionof signal samplesis repreSentedbysomepointinhyperspaceadistance\mPfrom the origin. The sum of a signal plus noise is representedby some point a little distanceaway from the point representingthe signal alone.In fact, we seethat the distancefrom the point representing the signal aloneto the point representingthe signalplus the noise is\m.Thus,thesigna1p1usthenoise1iesinalitt1ehyper- sphereofradius\Wcenteredonthepointrepresentingthe signal alone. Now, we don't receivethe signal alone. We receivea signalof average energy P per sample plus Gaussian noise of average energy N per sample. In a time T the total received energy is TWT(P + l/) and the point representingwhatever signal was sent plus whatevernoise was added to it lies within a hypersphere ofradius w. After we have receiveda signal plus noise for T secondswe can find the location of the point representingthe signal plus noise. But how are we to find the signal? We only know that the signal lieswithinadistance\WofthepointrepreSentingthesignal plus noise. How can we be sureof deducingwhat signalwas sent? Suppose that we put into the hypersphereof radius , in which points representinga signal plus noise must lie, a large number of little nonoverlappirg hyperspheresof radius a bare shadelarger than tEWffi. Let us then send only signalsrepre- sentedby the centerpoints of theselittle spheres. When we receivethe zWT samplesof any particular oneof these signalsplus any noise samples,the correspondingpoint in hyper- spacecan only lie within the particular little hyperspheresurround- Many Dimensions t75 itg that signalpoint and not within any other. This is so because, aswe havenoted, the points representinglong sequencesof samples producedby an ergodicnoise source must be almostat the surface ofasphereofradius\mThus,thesignalsentcanbeidenti- fied infallibly despitethe presenceof the noise. How many such nonoverlapping hyperspheres of radius \EW can be placedin a hypersphereof radius W The number certaitly cannotbe largerthan the ratio of the volume of the larger sphereto that of the smallersphere. The numbern of dimensionsin the spaceis equal to the number of signal (and noise)samples 2 WT. The volume of a hypersphere in a spaceof n dimensionsis proportionaI to rn. Hence,the ratio of the volume of the large signal-plus-noisesphere to the volume of the little noisesphere is (W\zwr: (p+tt\" \@) \N/ This is a limit to the number of distinguishablemessages we can transmit in a time Z The logarithm of this number is the number of bits which we can transmit in the time Z It is

rr/Tr^-wr rog\ /P + ,nr\ 1,, / As the messageis. 7" seconds long, the correspondingnumber of bits per secondC is C - W\og (l + P/N) (e.3) Having got to this point, we can note that the ratio of average energyper signalsample to averageenergy per noisesample must be equal to the ratio of averagesignal power to averagenoise power, and we can,in 9.3,regard P/N as the ratio of signalporver to noise power instead of as the'ratio of averagesignal-sample energyto averagenoise-sample energy. The foregoirg argument,which led to 9.3,has merely shownthat no more than C bits per secondcan be sent with a band width of W cycles per second using a signal of power P mixed with a Gaussian noise of power l/. However, by a further geometrical argument,in which he makes use of the fact that the volume of a 176 Symbols,Signals and Noise hypersphereof high dimensionality is almost all very closeto the surface,Shannon shows that the signalirg rate can approachas close as one likes to C as given by 9.3 with as small a number of errors as one likes. Hence, C, as given by 9.3, is the channel capacity for a continuous channel in which a Gaussiannoise is added to the signal. It is perhapsof someinterest to compareequation 9.3 with the expressionsfor speedof transmissionand for information which Nyquist and Hartley proposedin 1928and which we discussedin Chapter II. Nyquist and Hartley's resultsboth saythat the number of bin ary digits which can be transmittedper secondis nlogm Here m Lsthe number of different symbols,and n is the number of symbols which aretransmitted per second. One sort of symbol we can consider is a particular value of voltage,as, +3, * l, - l, or -3.Nyquist knew,as we do, thatthe number of independentsamples or valuesof voltage which can be transmitted per secondis 2W. By using this fact, we can rewrite equation 9.3 in the form C - (n/2) log (1 + P/1,{) C-nlog@ Here we are really merely retracitg the stepswhich led us to 9.3. We seethat in equation 9.3 we have got at the averagenumber m of different symbolswe can sendper sample,in terms of the ratio of signal power to noise power. If the signal power becomesvery small or the noise power becomesvery large, so that P/ N is nearly 0, then the averagenumber of different symbols we can transmit per samplegoes-to 1og1_0 Thus, the aveta1enumber of symbolswe can transmit per sample and the channel cap acrty go to 0 as the ratio of signal power to noise power goesto 0. Of course,the number of symbolswe can transmit per sampleand the channel capacitybecome large as we make the ratio of signal power to noisepower large. Our understandingof how to send a large averagenumber of Many Dimensions t77 independentsymbols per sample has,however, gone far beyond anything which Nyquist or Hartley told us. We know that if we are to do this most efficiently we must, in general, not fty to encodea symbol for transmissionas a particular sample voltage to be sent by itself. Instead, we must, in general,resort to the now-familiar procedure of block encodirg and encode a long sequenceof symbols by means of a large number of successive samples.Thus, if the ratio of signal power to noisepower ts 24, wecanontheaVeragetransmitwithnegligibleerror\m4_ \F : 5 different symbolsper sample,b.ri we can't transmit anyr of 5 different symboisby meansof one particular sample. In Figure VIII- I of Chapter VIII, we consideredsending binary digits one at a time in the presenceof noiseby using a signalwhich was either a positiveor a negativepulse of a pafticular amplitude and calling the receivedsignal a I if the signal plus noise was positive and a 0 if the received signal plus noise was negative. Supposethat in this casewe make the signal powerful enough comparedwith the noise,which we assumeto be Gaussian,so that only I receiveddigit in 100,000will be in error. Calculationsshow that this calls for about six times the signal power which equation 9.3 sayswe will needfor the same band width and noise power. The extra power is neededbecause we use asa signaleither a short positive or negativepulse specifying one binary digit, rather than using one of many long signalsconsisting of many ditrerent samples of various amplitudesto representmany successivebinary digits. One very specialway of approaching the ideal signaling rate or channel capaiity for a small, uu.tug. rlgt ul powet ii a laige noise power is to concentratethe signal po-*rt ir, a single sliort but powerful pulseand to sendthis pulse in one of many possibletime positions,each of which representsa differentsymbol. In this very specialand unusual casewe can efficiently transmit symbolsone at a time. If we wish to approachShannon's limit for a chosenbandwidth we mustuse as elementsof the codelong, complicated signal waves that resemblegaussian noise. We can if we wish look on relation 9.3 not narrowly as telling us how many bits per secondwe can send over a particular com- munication channelbut, rather, &stelling us somethingabout the 178 Symbols, Signals and Noise possibilitiesof transmitting a signalof a specifiedband width with some required signal-to-noiseratio over a communicationchannel of some other band width and signal-to-noiseratio. For instance, supposewe must senda signalwith a band width of 4 megacycles per secondand attain a ratio of signal power to noise powerP / N of 1,000.Relation 9.3 tells us that the correspondingchannel capacity is C - 40,000,000bits/second But the samechannel capacity can be attained with the combina- tions shown in Table XIII. Tesrn XIII Combinationsof W and P/N Which Give Same ChannelCapacitY

W P/N 4,000,000 1,000 8,000,000 30.6 2,000,000 1,000,000

We see from Table XIII that, in attainitg a given channel capacity, we can use a broader band width and a lower ratio of signal to noise or a narrower band width and a higher ratio of signal to noise. Early workers in the field of information theory were intrigued with the idea of cutting down the band width required by increas- irg the power used.This calls for lots of power. Experiencehas shown that it is much more useful and practical to increasethe band width so as to get a good signal-to-noiseratio with lesspower than would otherwisebe required. This is just what is done in FM transmission,as an example.In FM transmission,a particular amplitude of the messagesignal to be transmitted,which may, for instance,be music, is encodedas a radio signal of a particular frequency.As the amplitudeof the messagesignal rises and falls, the frequency of the FM signal which t"pt.sents it changesgreatly, So that in sendinghigh fidelity music *ttic6 has a band width of 15,000cps, the FM radio signal can range over a band width of 150,000cps. BecauseFM trans- Many Dimensions t79

mission makesuse of a band width much larger than that of the music of which it is an encoding,the signal-to-noiseratio of the receivedmusic can be much higher than the ratio of signalpower to noise power in the FM signal that the radio receiverreceives. FM is not, however,an ideally efficientsystem; it doesnot work the improvementwhich we might expectfrom 9.3. Ingeniousinventors are everdevising improved systemsof mod- ulation. TWicein my experiencesomeone has proposedto me a systemwhich purportedto do betterthan equation 9.3,for the ideal channel capacrty,allows. The suggestionswere plausible, but I knew, just as in the case of perpetual motion machines, that somethinghad to be wrong with them. Careful analysisshowed where the error lay. Thus, communicationtheory can be valuable in tellitg us what can't be accomplishedas weli as in suggesting what can be. One thing that can't be accomplishedin improving the signal- to-noise ratio by increasingthe band width is to make a syite* which will behavein an orderly and happy way for all ratjos of signal power to noisepower. According to the view put forward in this chapter,we look on a signalas a point in a multidimensionalspace, where the number of dimensionsis equalto the number of samples.To senda narrow- band signal of a few samplesby means of a broad-band signal havingmore samples,we must in someway map points in a rpure of few dimensionsinto points in a spaceof moie-dimensionsin a one-to-onefashion. Wuy back in ChapterI, we proved a theorem concerning-ontothe mapping of points of a spaceof two dimensions(a plane) points of a spaceof one dimension(a line). We proued that if we map each point of the plane in a one-to-onefashion into a single correspondingpoint on the line, the mapping cannot be continu- ous. That is, if we move smoothly along a path in the plane from point to nearbypoint, the correspondingpositions on the line must ju*p back and forth discontinuously.A similar theorem is true for the mapping of the points of any spaceonto a spaceof differ- ent dimeniibttality. Thi bodestroubli fot transmissionschenies in which few messagesamples are representedby many signal samples. 180 Symbols, Signals and Noise Shannon gives a simple example of this sort of trouble, which is illustrated in Figure IX-3. S.rpposethat we use two sample amplitudes v2 and v1 to representa single sampleamplitude il. We rcgard v2 and v1 &s the distance up from and to the right of the lower left hand corner of a square.In the square,we draw a snaky line which starts near the lower left-hand corner and goesback and forth acrossthe square, gradually progressingupward. We let distance along this line, measuredfrom its origin near the lower left-hand corner to some specifiedpoint along the line, be u, the voltage or amplitude of the signal to be transmitted. Ceitainly, ury value of u is representedby particular valuesof 'We v1 and v2. seethat the range of v1 ot v2 is lessthan the range of u. We can transmit u1 and v2 and then reconstructu with geat accuracy.Or can we? Supposea little noise gets into vr and v2,so that, when we try to 611athe corresponding value of u at the receiver, we land somewherein a circle of uncertainty due to noise.As long as the diameter of the circle is less than the distancebetween the loops of the snaky path, we can tell what the correct value of u ts to a fractional error much smaller than the fractional error of v1or v2.

I -- CIRCLE OF V2 UNCERTAINTY DUE TO NOISE

\t --t> Fig. IX-3 Many Dimensions l8l But if the noiseis larger,we can't be surewhich loop of the snaky path was intended, and we frequently make a larger error in u. This sort of behavioris inevitable in systems,such as FM, which use alarge band width in order to get a better signal-to-noiseratio. As the noiseadded in transmissionis increased,the noise in the received(demodulated) signal at first increasesgradually and then increasescatastrophically. The systemis said to "break" at this level of signal to noise. Here we have an instance in which a seeminglyabstract theorem of mathematicstells us that a certain type of behavior cannot be avoided in electricalcommunication systemsof a certain generaltype. The apProachin this chapter has been essentiallygeometrical. This is only one way of dealing with the problems of continuous signals.Indeed, Shannongives another in his book on communi- cation theory, 4n approach which is applicable to all types of signalsand noise.The geometricalapproach is interesting,how- ever, becauseit is provirg illuminating and fruitful in many prob- lems conceroitg electric signalswhich are not directly related to communication theory. Here we have arrived at a geometry of band-limited signalsby samplitg the signalsand then letting the amplitudesof the samples be the coordinatesof a point in a multidimensionalspace. It is possible, however, to geometrrze band-limited signals without speakingin terms of samples,and mathematiciansinterested in problemsof signaltransmission have done this. In fact, it is becom- itg increasingly common to represent band-limited signals as points in a multidimensional"signal space" or "function space" and to prove theoremsabout signalsby the methodsof geometry. The idea of signalsas points in a multidimensionalsignal space or function spaceis important, becauseit enablesmathematicians to think about and to make statementswhich are true aboutall band-limitedsignals, or about large classesof band-limitedsignals, without considering the confusirg details of particular signals, just as mathematicianscan make statementsabout oll tnanglesor all right triongles.Signal spaceis a powerful tool in the hands or, rather, in the minds of competent mathematicians.We can only wonder and admire. r82 Symbols, Signals and Noise From the point of view of communication theory, our chief concern in this chapter has been to prove an important theorem concerning a noisy lontinuous channel.The result is embodiedin equation 9.3,which givesthe rate at which we can transmitbinary digits with negligible error over a continuous channel in which a sifnal of band width W and power P is mixed with a white Gaussiannoise of band width W and power y'/. Nyquist knew, in 1928,that one can send 2W independgnt sytnUotsper secondover a channelof band width 2W, buthe didn't know how many different symbolscould be sent per secondfor a given ratio of signalpower to noisepower. We havefound this out Fot the caseof a particular, common type of noise.We alsoknow that evenif we cin transmit someaverage numb er m of symbols per sample,in general,we can't do this by trying to encodesuc- iessive symbolsindependently as particular voltages.Instead, we must useblock encodirg, and encode alarge number of successive symbolstogether. Equation 9.3 showsthat we can usea signal of largeband width and io* ratio of signal power to noise power in transmittinga messagewhich has I small band width and a large ratio of signal power to noise power. FM is an example of this. Suchconsidera- tions will be pursuedfurther in Chapter X. This chaptir has had another aspect.In it we have illustrated the useof a novel viewpoint and the application of a powerfulfield of mathematicsin attacking a problem of communicationtheory. Equation9.3 was arrived at by the by-no-means-obviousexpedient oflepresentirg long electricalsignals and the noisesadded to them by points in a multidimensionalspace. The squareof the distance oi u point from the origin was interpreted as the energyof the signal representedby the point. Thus, a problem in communicationtheory was madeto corre- spondto a problemin geometry,and the desiredresult was arrived at Uy geometricalarguments. We noted that the geometricalrepre- sentationof signalshas becomea powerful mathematicaltool in studying the transmissionand propertiesof signals. its:lf, but TLe feo.netrtzationof signalproblems is of interestin it is also of interest as an example of the value of seekingnew Many Dimensions 183 mathematicaltools in attackingthe problemsraised by our increas- ingly complex technology.It is only by applying this order of thought that we can hope to deal with the increasingly difficult problems of engineering. CHAPTER Information Theory and Plrysics

I nnvn crvEN soMErHrNGof the historical background of com- munication theory in Chapter II. From this we can see that communication theory is an outgrowth of electrical communica- tion, and we know that the behavior of electric currents and electric and magneticfields is a patt of physics. Tb Morse and to other early telegraphists,electricity provided a very limited meansof communicationcompared with the human voice or the pen in hand. Thesemen had to devisecodes by means of which the lettersof the alphabetcould be representedby turning an electric current successivelyon and off. This sameproblem of the representationof material to be communicatedby varioussorts of eleitrical signalshas led to the very generalideas concerning encodirg which areso important in communicationtheory. In this relation of encoditrgto particular physicalphenomena, we seeone link between communication theory and physics. We have also noted that when we transmit signalsby meansof wire or radio, we receivethem inevitably admixed with a certain amount of interfering disturbanceswhich we call noise.To some degree,we can avoid such noise. The noise which is generatedin oui receivirg apparatuswe can reduceby .carefuldesign and by ingenious invention. In receiving radio signals, we can use an which receivessignals most effectivelyfrom the direction of the transmitter and which is lesssensitive to signalscoming from

184 Information Theoryand Physics 185 other directions. Further, we can make sure that our receiver resPondsonly to the frequencieswe mean to use and rejectsinter- fering signalsand noise of other frequencies. Still, when all this is done, some noise will inevitably remain, mixed with the signalsthat we receive.Some of this noise may come from the ignition systemsof automobiles.Far away from man-madesources, some may come from lightnirg flashes.But even if lightning were abolished, some noise would persist, &s surely as there is heat in the universe. Many yearsago an English biologist named Brown saw small pollen particles,suspended in a liquid, dance about erratically in the field of his microscope.The particlesmoved sometimesthis way and sometimesthat, sometimesswiftly and sometimesslowly. This we call Brownianmotion Brownian motion is causedby the impact on the particles of surrounding molecules, which themselves executeeven a wilder dance.One of Einstein'sfirst major works was a mathematical analysisof Brownian motion. The pollen grainswhich Brown observedwould have remained at rest had the moleculesabout them been at rest, but molecules ate alwaysin random motion. It is this motion which constitutes heat. In a ges,a moleculemoves in a disorgantzed way. It moves swiftly or slowly in straight lines betweenfrequent collisions.In a liquid, the moleculesjostle about in close proximity to one another but continually changirg place, sometimesmoving swiftly and sometimesslowly. In a solid, the moleculesvibrate about their mean positions,sometimes with a large amplitude and sometimes with a small amplitude,but never moving much with respectto their nearestneighbors. Always, however,in gas,liquid, or solid, the moleculesmove, with an averageenergy due to heat which is proportional to the temperature above absolute zero, however erratically the speedand energymay vary from time to time and from moleculeto molecule. Energy of mechanical motion is not the only energy in our universe.The electromagneticwaves of radio and light also have energy.Electromagnetic waves are generatedby changingcurrents of electricty.Atoms are positively chargednuclei surroundedby negativeelectrons, and molecules aremade up of atoms. When the molecules of a substancevibrate with the energy of heat, 186 Symbols, Signals and Noise relative motions of the chargesin them can generateelectromag- netic waves,and thesewaves have frequencieswhich includethose of what we call radio, heat, and light waves.A hot body is said to radiate electromagneticwaves, and the electromagneticwaves that it emits are called radiation. The rate at which a body which is held at a given temperature radiates radio, heat, and light wavesis not the samefor all sub- stances.Dark substancesemit more radiation than shiny sub- stances.Thus, silver,which is called shiny becauseit reflectsmost of any wavesof radio, heat,or tight falling on it, is a poor radiator, while the carbon particlesof black ink constitutea good radiator. When radiation falls on a substance,the fraction that is reflected rather than absorbedis different for radiation of different frequen- cies, such as radio wavesand light waves.There is a very general rule, however, that for radration of a given frequency,the amount of radiation a substanceemits at a given temperatureis directly proportional to the fraction of any radiation falling on it which is iUsbrUedrather than reflected.It is as if there were a skin around each substancewhich allowed a certain fraction of any radiation falling on it to passthrough and reflectedthe rest, and as if the fraction that passedthrough the skin were the samefor radiation either entering or leaving the substance. If this were not so, we might expecta curiousand unnatural(as we know the laws of nature) phenomenon.Let us imagine a com- pletely closed box or furnace held at a constant temperature.Let us imagine that we suspendtwo bodiesinside the furnace.Suppose (contrary to fact) that the first of thesebodies reflectedradiation well, abiorbirg little, and that it also emitted radiation strongly, while the second absorbed radiation well, reflectitg little, but emitted radiation poorly. S.rpposethat both bodiesstarted out at the same temperature.The first would absorb lessradiation and emit more rabiation than the second, while the second would absorb more radiation and emit less radtation than the fi.rst.If this were so, the secondbody would becomehotter than the first. This is not the case,however; all bodies in a closedbox or furnace whose walls are held at a constant,uniform temperature attain just exactlythe sametemperature as the walls of the furnace, whether the bodiesare shiny, reflectirg little radiation and absorb- Information Theoryand Physics r87 irg much, or whetherthey aredark, reflectirg little radiation and absorbingmuch. This can be so only if the ability to absorbrather than reflect radiation and the ability to emit radiation go hand in hand, as they alwaysdo in nature. Not only do all bodiesinside such a closedfurnace attain the same temperature as the furnace; there is also a characteristic intensity of radrationin such an enclosure.Imagin e a part of the radiation inside the enclosureto strike one of the walls. Somewill be reflectedback to remain radiation in the enclosure.Some will be absorbedby the walls. In turn, the walls will emit someradia- tion, which will be added to that reflectedaway from the walls. Thus, there is a continualinterchange of radiation betweenthe interior of the enclosureand the walls. If the radiation in the interior were very weak, the walls would emit more radiation than the radiation which struck and was absorbedby them. If the radiation in the interior were very strong, the walls would receive and absorb more radiation than th.y emitted. When the electromagneticradiation lost to the walls is just equal to that suppliedby the walls, the radiation is said to be in equilibrium with its material surroundings. It has an energy which increaseswith temperature,just as the energy of motion of the moleculesof a g&s, a liquid, or a solid increaseswith temperature. The intensity of radiation in an enclosuredoes not dependon how absorbitg or reflectirg the walls of the enclosure a^re;it dependsonly on the temperatureof the walls. If this were not so and we made a little hole joining the interior of a shiny, reflecting enclosurewith the interior of i Outt,absorbing enclosureat the sametemperature, there would have to be a net flow of radiation through the hole from one enclosure to another at the sametem- perature.This neverhappens. We thus seethat thereis a particular intensity of electromagnetic radration, such as light, heat, and radio waves,which is character- istic of a particular temperature.Now, while eletromagneticwaves travel through vacuuffi, airr,or insulating substancessuch as glass, they can be guided by wires. Indeed, we can think of the signal sent along a palr of telephonewires either in terms of the voltage betweenthe wires and the current of electronswhich flows in the 188 Symbols, Signals and Noise wires, or in termsof a wavemade up of electricand magneticfields betweenand around the wires,a wave which movesalong with the current. As we can identify electricalsignals on wireswith electro- magnetic waves,and as hot bodiesradiate electromagneticwaves, *e ihould expectheat to generatesome sort of electricalsignals. J. B. Johnson,who discoveredthe electricalfluctuations caused by heat,described them, not in termsof electromagneticwaves but in terms of a fluctuating voltageproduced acrossa resistor. Once Johnson had found and measured these fluctuations, another physicistwas able to find a correct theoreticalexpression for their magnitude by upplying the principles of statisticalme- chanics.This secondphysicist was none other than H. Nyquist, who, aswe sawin ChapterII, alsocontributed substantially to the early foundationsof information theory. Nyquist's expressionfor what is now called either Johnsonnoise or thermalnoise ts W - 4KTRW ( 10.1) Here V 2 ts the mean squarenoise voltage,that is, the average value of the squareof the noisevoltage, across the resistor.k is Boltzmann'sconstant: k : 1.37X 10-23joule/degree Z is the temperatureof the resistorin degreesKelvin, which is the number of Celsiusor centigradedegrees (which are e/sas large as Fahrenheitdegrees) above absolute zero. Absolute zerois -273" centigradeor - 459" Fahrenheit.R is the resistanceof the resistor rneasuredin ohms. W is the band width of the noise in cycles per second. Obviously, the band width W dependsonly on the propertiesof our measuringdevice. If we amplify the noise with a broad-band amplifier we get more noisethan if we use a narrow-bandamplifier of the samegain. Hence,we would expectmore noisein a television receiver, which amplifies signals over a band width of several million cyclesper second,than in a ,which amplifies signalshaving a band width of severalthousand cycles Per second. We have seenthat a hot resistorproduces a noise voltage.If we connect anotherresistor to the hot resistor,electric powerwill flow Information Theory and Physics 189 to this secondresistor. If the secondresistor is cold, the powerwill heat it. Thus, a hot resistoris a potential sourceof noisepower. What is the most noisepower .A/that it can supply?The power is r/ - kTW ( 10.2) In someways , 10.2is more satisfactorythan 10.1. For onething, it hasfewer terms; the resistanceR no longer appears.For another thing, its form is suitable for application to somewhat different situations. For instance,suppose that we have a radio telescope,a big parabolicreflector which focusesradio wavesinto a sensitiveradio receiver.I have indicatedsuch a radio telescopein Figure X- l. Supposewe point the radio telescopeat different celestialor ter- restrial objects,so as to receivethe electromagneticnoise which they radiate becauseof their temperature. We find that the radio noise power receivedis given by 10.2, where T is the temperature of the object at which the radio telescopepoints. If we point the telescopedown at water or at smooth grouild, what it actu ally seesis the reflection of the sky, but if we point it at things which don't reflect radio waves well, such as leafy trees or bushes, we get a noise corresponding to a temperature around 290" Kelvin (about 62' Fahrenheit), the temperature of the trees. If we point the radio telescope at the moon and if the telescope

SUN I >h(- r..- Z\{.\ l'

t' r,iA RADIO TREES TELESCOPE t

Fig. X-I 190 Symbols, Signals and It{oise is directive enoughto seejust the moon and not the sky around it, we get about the same noise, which correspondsnot to the temperatureof the very surfaceof the moon but to the temperature a fraction of an inch down, for the substanceof the moon is some- what transparentto radio waves. If we point the telescopeat the sun, the amount of noisewe obtain depends on the frequency to which we tune the radio receiver.If we tune the receiverto a frequencyaround l0 million cycles per second (a wave length of 30 meters),we get noise correspondingto a temperatureof around a million degreesKelvin; this is the temperatureof the tenuousouter coronaof the sun.The corona is transparentto radio wavesof shorterwave lengths,just as the air of the earth is. Thus, if we tune the radio receiverto a frequencyof around l0 billion cyclesper second,wg receiveradia- tion correspondingto the temperatureof around 8,000' Kelvin, the temperature a little above the visible surface. Just why the corona is so much hotter than the visible surfacewhich lies below it is not known. The radio noise from the sky is also different at different fre- quencies.At frequenciesabove a few billion cyclesper secondthe noisecorresponds to a temperatureof about3.5o Kelvin. At lower frequenciesthe noiseis greater and increasessteadily as the fre- quency is lowered. The Milky Wuy, particular stars, and island universesor galaxiesin collision all emit large amountsof radio noise.The heavensare telling the age of the universethrough their 3.5 degreemicrowave radiatiotr, but other signalscome from other placesat other frequencies. Nonetheless,Johnson or thermal noise constitutesa minimum noise which we must accept, and additional noise sourcesonly make the situation worse. The fundamental nature of Johnson noisehas led to its beingused as a standardin the measurementof the performanceof radio receivers. As we have noted, a radio receiveradds a certain noiseto the signalsit receives.It also amplifies any noise that it receives.We can ask, how much amplified Johnson noise would just equalthe noise the receiveradds? We can specify this noise by meansof an equivalent noisetemperature Tn. This equivalentnoise temperature Tn is a measureof the noisinessof the radio receiver.The smaller Tn rs the better the receiveris. Information Theoryand Physics l9l

We can interpret the noise temperature Tn in the followitg way. If we had an ideal noiselessreceiver with just the same gain and band width as the actual receiver and if we added Johnson noise corresponding to the temperature Tn to the signal it received, then the ratio of signal power to noise power would be the same for the ideal receiver with the Johnson noise added to the signal as for the actual receiver. Thus, the noise temperature Tnis a just measure of the noisiness of the receiver. Sometlmes another measure based on Tn rs used; this is called the noise figure N F. In terms of T* the noise figure is 293 Tn I,{F - + 293

I\tF- I + L ( 10.3) 293 The noise figure was definedfor use here on earth, where every signalhas mixed with it noisecorresponding to a temperatureof around 293' Kelvin. The noisefigure is the ratio of the total output noise,including noisedue to Johnsonnoise for a temperatureof 293" Kelvin at the input and noiseproduced in the receiver,to the amplified Johnsonnoise alone. Of course,the equivalentnoise temperature Tn of a radio receiver dependson the nature and perfectionof the radio receiver,and the lowest attainablenoise figure dependson the frequencyof opera- tion. However, Table XIV below gives some rough figures for varioussorts of receivers. The effectivetemperatures of radio receiversand the tempera-

Tasrr XIV

Equivalent Typeof Receiver Noise Temperature, Degrees Kelvin

Good Radio or TV receiver 1500'

Maser receiving station for spacemissions 24"

Parametric amplifier receiver 50" r92 Symbols,Signals and Noise tures of the objectsat which their antennasare directedare very important in connection with communication theory, because noise determinesthe power required to sendmessages. Johnson noise is Gaussiannoise, to which equation 9.3 applies.Thus, ideally, in order to transmit C bits per second,we must have a signalpower P relatedto the noisepower N by a relationthat was derived in the preceditg chapter:

C - Wlog (9.3)

If we use expression10.2 for noise,this becomes c - wrosf$j\ ( 10.4) \krw / Let us assumea given signal power P. If we make W very small, C will becomevery small. However, if we make W larger and larger, C doesnot becomelarger and larger without limit, but rather it approachesa limiting value.When P /kTW becomesvery small comparedwith unity, 10.4becomes 1.44P a-v vkT (l0.s)

We can also write this P - 0.693kTC ( 10.6) Relation 10.6 says that, even when we use a very wide band width, we need at leasta power 0.693kf joule per secondto send one bit per secotrd,so that on the averagewe must usean energy of 0.693 kT joule for each bit of information we transmit. We should remember,however, that equation 9.3 holds only for an ideal sort of encodingin which many charactersrepresenting many bits of information are encoded together into a long stretch of signal.Most practical communicationsystems require rnuchmore energyper bit, as we noted in Chapter IX. But, haven't we forgotten something?What about quantum effects?These may not be importantin radio,but they arecertainly important in communicationby light. And, light hasfound a wide range of uses. Tiny optical fibers carry voice and other com- Information Theoryand Physics 193 munication traffic. Pulsesof light reflectedfrom "corner reflectors" left by astronautson the moon make it possible to track the earth- moon distancewith an accuracy of about a tenth of a meter. Harry Nyquist was a very forward-looking man. The derivation he gave for an expression for Johnson noise applies to all fre- quencies,including those of light. His expressionfor Johnson noise power in a bandwidth W was

hfw N- ( 10.7) ehf /kr _ |

Here I is frequencyin cyclesper secondand h- 6.63x 10-34joule/sec. is Planck's constant. We commonly associatePlanck's constant with the energy of a single photon of light, an energyE which is E -hf ( 10.9)

Quantum effectsbecome important when hl is comparable to or larger than kf. Thus, the frequency above which the exact expres- sion for Johnson noise, 10.7, departs seriously from the expression valid at low frequencies,I0.2, will be about - f kT/h_ 2.07x 1010f ( 10.9)

When we take quantum effects into account, we find less not more noise at higher frequencies, and very much less noise at the frequency of light. But, there are quantum limitations other than those imposed by Johnson noise. It turns out that, ideally, 0.693 kZ joules per bit is still the limit, even at the frequency of light. Practically, it is impossible to approach this limit closely. And, there is a common but wrong way of communicating which is very much worse. This is to amplify a weak received signal with the best sort of amplifier that is theoretically possible.Why is this bad? At low frequencies, when we amplify a weak pulse we simply 194 Symbols, Signals and Noise obtain a pulseof greaterpower, and we can measureboth the time that this strongerpulse peaks and the frequencyspectrum of the pulse.But, the quantummechanical uncertainty principle says that we cannot measureboth time and frequencywith perfect accuracy. In fact, if Lt is the error in measurementof time and Lf is the error in measurementof frequenc/,the bestwe can possiblydo is a,tLf - 1 A,t= I/Lt ( 10.10)

The implication of 10.L0 is that if we define the frequency of a pulse very precisely by making Lf small, we cannot measure the time of arrival of the pulse very accurately. In more mundane terms, we can't measure the time of arrival of a long narrow-band pulse as accurately as we can measure the time of arrival of a short broad-band pulse. But, just how are we frustrated in trying to make such a measurement? Suppose that we do amplify a weak pulse with the best possible amplifier, and also shift all frequencies of the pulse down to some low range of frequencies at which quantum effects are negligible. We find that, mixed with the amplified signal, there will be a noise power N f/ - Ghf W ( 10.11)

Here I is the frequency of the original high-frequency signal, G is the power gain of the amplifyirg and frequency shiftitg system and W is the bandwidth. This noise is just enough to assure that we don't make measurementsmore accuratelythan allowed by the un- certainty principle as expressedin 10.10. In order to increase the accuracy of a time measurement, we must increase the bandwidth W of.the pulse. But, the added noise due to increasingthe bandwidth, as expressedby 10.11, just undoes any gain in accuracy of measurement of time. From 10.11 we can make the same argument that we made on page 192, and conclude that because of quantum effects we must use an energy of at least 0.693 hl joules per bit in order to com- municate by means of a signal of frequency f . This argument is Information Theoryand Physics 195 valid only for communication systems in which we amplify the received signal with the best possible amplifier-an amplifier which adds to the signal only enough noise to keep us from violat-^ ing the uncertainty principle. Is there an alternative to amplifying the weak received signal? At optical frequencies there certainly is. Quanta of light can be used to produce tiny pulses of electricity. Devices called photo- diodes and photomultipliers produce a short electrical output pulse when a quantum of light reaches them-although they randomly fail to respond to some quanta. The quantum efficiency of actual devicesis lessthan l00Vo . Nonetheless, ideally we can detect the time of arrival of any quantum of light by using that quantum to produce a tiny electrical pulse. Isn't this contrary to the uncertainty principle? No, because when we measure the time of arrival of a quantum in this way we can't tell anything at all about the frequency of the quantum. Photo detectors,commonly called photon coltnters, are used to measure the time of arrival of pulses of light reflected from the corner reflectors that astronauts left on the moon. They are also used in lightwave communication via optical fibers. Of course they aren't used as effectively as is ideally possible. What is the ideal limit? It turns out to be the same old limit we found on page 192, that is, 0.693 kZ joules per bit. Quantum effects don't change that limiting performance, but they do make it far more difficult to attain. Why is this so? The energyper photon is hl. Ideally, the energy per bit is 0.693 kT. Thus, ideally the bits per photon must be hf/(0.693kr) or 1.44 (hf/kf ) bits per photon How canwe transmitmany bits of energyby measuringthe time of arrival of just a few photons,or one photon? We can proceed in this way. At the transmitter,we sendout a pulse of light in one of.M sub-intervalsof some long time interval T. At the receiving end, the particular time interval in which we receive a photon conveysthe message. 196 Symbols, Signals and Noise

At best, this will make it possible to convey log M bits of in- formation for each interval T. But, sometimes we will receive no photons in any time interval. And, sometimesthermal photons- Johnson noise photons-will arrive in a wrong time interval. This is what limits the transmission to 1.44 (hf /kT) bits per photon. In reality, wo can send far fewer bits per photon because it is impractical to implement effective systemsthat wilt send very many , bits per photon. We have seen through a combination of information theory and physics that it takes at least A.693 kZ joules of energy to convey one bit of information. In most current radio receivers the noise actually present is greater than the environmental noise because the amplifier adds noise corresponding to some temperature higher than that of the environment. Let us use as the noise temperature Tn, the equivalent noise temperature corresponding to the noise actually added to the signal. How does actual performance compare with the ideal ex- pressionby equation ( 10.4)? When we do not use error correction, but merely use enough signal power so that errors in receiving bits of information are very infrequent (around one error in 108 bits received), at best we must use about 10 times as much power per bit as calculated from expression10.4. The most sophisticated communication systems are those used in sending data back from deep-space missions. These are ex- tremely low-noise maser receivers, and they make use of sophisti- cated error correction, including convolutional coding and decoding using the Viterbi . In sending pictures of Jupiter and its satellites back to earth, the Voyager spacecraft could transmit lI5,2O0 binary digits per second with an error rate of one in 200 by using only 21 .3 watts of power. The power is only 4.4 db more than the ideal limit using infinite bandwidth. Pluto is about 6 x 1012 meters from earth. Ideally, how fast could we send data back from Pluto? We'll assumenoise from space , only, and no atmosphere absorption. If we use a transmitting antenna of effective arca Ay and a re- ceiving antenna of effective area A n, the ratio of received power Information Theory and Physics t97 Pn to transmittedpower Pr is givenby Friis's transmissionformula

PR ArAo ( 10.12) Pr x2L2 where I is the wavelength used in communication and L is the distance from transmitter to receiver, which in our case is 6 x 1012 meters. Purely arbitrarily, we will assume a transmitter power of 10 watts. We will consider two cases.In the first, we use a wavelen$h of 1 cm or.01 meters.At this wavelength the temperature of space is indubitably 3.5o K, and we will assume that the transmitting antennais 3.16 meterssquare, with an area of 10 square meters, and the receiving antenna is 31.6 meters square,with an area of 1,000 square meters. According to expression 10.12, if the transmitted power is 10 watts the received power will be 2.8 x 10-17 watts or joules per second.If we take the energy per bit as 0.693 kT,7 as 3.5oK, and use the value of Boltzmann's constant given on page 188, we conclude that our communication system can send us over 800,000 bits per second, a very useful amount of information. What about an system?Let us assumea wavelength of 6 x 10-7 meters, corresponding to a frequency of 5 x L014cycles per second. This is visible light. We will assume somewhat smaller antennas (lenses or mirrors), a transmitting antenna one meter square, with an area of 1 square meter and a receiving antenna 10 meters square, with an area of 100 square meters. Again, we will assume a transmitted power of 10 watts. The effective optical "temperature" of space, that is, the total starlight, is a little uncertain; we will take it as 350oK. We calculate a transmission capacity of 800 billion bits per second for our . If we receive 800 billion bits per second, we must receive 100 bits per photon. It seemsunlikely that this is possible. But even if we received only one bit per photon we would receive 8 billion bits/second. Optical communication may be the best way to com- municate over long distancesin space. 198 Symbols, Signals and Noise From the point of view of information theory, the most interest- irg relation between physics and information theory lies in the evaluation of the unavoidablelimitations imposed by the laws of physics on our ability to communicate. In a very fundamental sense,this is concernedwith the limitations imposed by Johnson noise and quantum effects.It also, however,includes limitations imposed by atmospheric turbulence and by fluctuations in the ionosphere,which can distort a signalin a way quite differentfrom adding noiseto it. Many other examplesof this sort of relationof physics to information theory could be unearthed. Physicistshave thought of a connection between physicsand communication theory which has nothing to do with the funda- mental problem that communication theory set out to solve,that is, the possibilitiesof the limitations of efficientencodirg in trans- mitting information over a noisy channel.Physicists propose to use the idea of the transmissionof information in order to show the impossibility of what is called o perpetual-motionmachine oJ the secondkind. As a matter of fact, this idea precededthe invention of communication theory in its presentform, for L. Szilardput forward suchideas in 1929. Some perpetual-motionmachines purport to createenergy; this violates the first law of thermodynamics,this is, the conservation of energy. Other perpetual-motionmachines purport to convert the dis- organued energyof heat in matter or radiation which is all at the same temperatureinto ordered energy,such as the rotation of a flywheel. The rotating flywheel could, of course,be used to drive a refrigerator which would cool some objects and heat others. Thus, this sort of perpetual motion could, without the use of additional organizedenergy, transfer the energy of heat from cold material to hot material. The secondlaw of thermodynamicscan be variously stated:that heat will not flow from a cold body to a hot body without the expenditure of org anrzedenergy or that the entropy of a system never decreases.The second sort of perpetual-motion machine violates the secondlaw of thermodynamics. One of the most famous perpetual-motion machinesof this second kind was invented by JamesClerk Maxwell. It makesuse of a fictional character called Maxwell's demon. Information Theoryand Physics r99 I have pictured Maxwell's demon in Figure X-2. He inhabits a divided box and operatesa small door connecting the two cham- bers of the box. When he seesa fast moleculeheading toward the door from the far side,he opensthe door and lets it into his side. When he seesa slow moleculeheading toward the door from his side he lets it through.He keepsslow moleculesfrom enteringhis side and fast moleculesfrom leavirg his side. Soon, the gasin his side is made up of fast molecules.It is hot, while the gas on the other side is madeup of slow moleculesand it is cool. Maxwell's demon makesheat flow from the cool chamberto the hot chamber. I have shownhim operatirg the door with one hand and thumbirg his noseat the secondlaw of thermodynamicswith his other hand. Maxwell's demonhas been a real puzzlerto thosephysicists who have not merely shruggedhim off. The best general objection we can raiseto him is that, sincethe demon'senvironment is at thermal equilibrium, the only light presentis the random electromagnetic radiation correspondingto thermal noise, and this is so chaotic that the demon can'tuseit to seewhat sort of moleculesare coming toward the door. We can think of other versionsof Maxwell's demon. What about puttin g a spring door betweenthe two chambers,for instance?A molecule hitting such a door from one side can open it and go through; one hitting it from the other side can't open it at all. Won't we end up with all the moleculesand their energy on the side into which the sprirg door opens?

Fig. X-2 200 Symbols, Signals and Ir{oise One objectionwhich can be raisedto the spring door is that, it the spring is strotrg,a moleculecan't open the door, while, if the spring is weak, thermal energy will keep the door continually flappitrB, and it will be mostly open. Too, a molecule will give energy to the door in openingit. Physicistsare pretty well agreed that such mechanicaldevices as spring doors or delicateratchets can't be usedto violate the secondlaw of thermodynamics. Arguing about what will and what won't work is a delicate business. An ingenious friend fooled me completely with his machine until I rememberedthat any enclosureat thermal equi- librium must containrandom electromagneticradiation as well as molecules.However, there is one simple machinewhich, although it is frictionless,ridiculous, and certainly inoperablein any prac- tical sense,is, I believe,not physicallyimpossible in the very special sensein which physicists use this expression.This machineis illustrated in Figure X-3. The machinemakes use of a cylinder C and a frictionlesspiston P. As the piston movesleft or right, it raisesone of the little pans p and lowers the other. The piston has a door in it which can be opened or closed.The cylinder containsjust one moleculeM. The whole deviceis at a temperature Z The moleculewill continually gain and loseenergy in its collisionswith the walls,and it will have an averageenergy proportional to the temperature. When the door in the piston is open,no work will be doneif we move the piston slowly to the right or to the left. We start by centerirg the piston with the door open. We clamp the piston in D I c I

s.=jL __fi_Se

s,4 ,il-E- s, Information Theory and Physics 201 the centerand closethe door. We then observewhich side of the piston the moleculeis on. When we have found out which side of the piston the moleculeis or, we put a little weight from low shelf ,Sr onto the pan on the same side as the molecule and unclamp the piston. The repeatedimpact of the moleculeon the piston will eventuallyraise the weight to the higher shelf 52, &nd we take the weight off and put it on this higher shelf. We then open the door in the piston, centerit, and repeat the process.Eventually, we will have lifted an enormousnumber of little weightsfrom the lower strelvesSr to the upper shelvesSz. We have done organrzedwork by meansof disorganrzedthermal energy! How much work have we done? It is easily shown that the averageforce ^Fwhich the moleculeexerts on the piston is

IN KT ( 10.10) L Here Z is the distancefrom the piston to the end of the cylinder on the side containingthe molecule.When we allow the molecule to push against the piston and slowly drive it to the end of the cylinder,so that the distanceis doubled,the mostwork W that the moleculecan do is w - 0.693kT (l0.lr)

Actually, in lifting a constant weight the work done will be less but 10.I I representsthe limit. Did we get this free? Not quite! When we have centeredthe piston and closedthe door it is equallylikely that we will find the moleculein eitherhalf of the cylinder. In order to know which pan to put the weight or, we needone bit of information, specifyirg which side the molecule is on. To make the machine run we must receivethis information in a systemwhich is at a temperatureT. What is the very least energyneeded to transmit one bit of information at the tempera- ture 7? We have alreadycomputed this; from equation 10.6we seethat it is exactly0.693 kTjoule, just equal to the most energy the machinecan generate.fn principle,this appliesto the quantum case,if we do the best that is possible.Thus, we use up all the output of the machinein transmittingenough information to make themachine run! 202 Symbols, Signals and lVoise It's uselessto argueabout the actual,the attarnable,as opposed to the limiting efficiencyof such a machine; the important thing is that evenat the very bestwe could only breakeven. We have now seenin one simple casethat the transmissionof information in the senseof communicationtheory can enableus to convert thermal energyinto mechanicalenergy. The bit which measuresamount of information usedis the unit in termsof which the entropy of a messagesource is measuredin communication theory. The entropy of thermodynamicsdetermines what part of existing thermal energy can be turned into mechanicalwork. It seemsnatural to try to relate the entropy of thermodynamicsand statistical mechanicswith the entropy of communicationtheory. The entropyof communicationtheory is a measureof theuncer- tainty as to what message,among many possible messageS,a messagesource will actually produce on a given occasion.If the source choosesa messagefrom among m eqvally probablemes- sages,the entropy in bits per messageis the logarithm to the base 2 of m; Lnthis caseit is clearthat suchmessages can be transmitted by means of log m brnary digits per message.More generally,the importance of the entropy of communication theory is that it measuresdirectly the averagenumber of binary digits requiredto transmit messagesproduced by u messagesource. The entropy of statisticalmechanics is the uncertaintyas to what state a physicalsystem is in. It is assumedin statisticalmechanics that all statesof a given total energyare equally probable.The entropy of statisticalmechanics is Boltzmarrn'sconstant timesthe logarithm to the base e of the number of possible states.This entropy has a wide importancein statisticalmechanics. One matter of importance is that the free energy,which we will call F.E., is given by F,E,: E - HT ( 10.12) Here ^Eis the total energy, H is the entropy, and Z is the tempera- ture. The free energyis the part of the total energywhich, ideally, can be turned into organized energy, such as the energy of a' lifted weight. In order to understandthe entropy of statisticalmechanics, we have to saywhat a physicalsystem is, and we will do this by citing Information Theory and Physics 203 a few examples.A physical systemcan be a crystalline solid, a closedvessel containing water and water vapor, a container filled with gas,or any other substanceor collectionof substances.We will considersuch a systemwhen it is at equilibrium, that is, when it has settleddown to a uniform temperatureand when any physi- cal or chemicalchanges that may tend to occur at this temperature have gone as far as they will go. As a particular exampleof a physical systeffi,we will consider and idealizedgas made up of a lot of little, infinitely smallparticles, whrzzing around every which way in a container. The stateof sucha systemis a completedescription, or as com- plete a descriptionas the laws of physicsallow, of the positions and velocities of all of these particles. According to classical mechanics(Newton's laws of motion), each particle can haveany velocity and energy,so there is an uncounta6ly infinite number of states,as thereis suchan uncountableinfinity of points in a line or a square.According to quantum mechanics,there is an infinite but countablenumber of states.Thus, the classicalcase is analo- gous to the difficult communicationtheory of continuous signals, while the more exactquantum caseis analogousto the communi- cation theory of discretesignals which are made up of a countable set of distinct, differentsymbols. We have dealt with the theory of discretesignals at length in this book. According to quantum mechanics,a particle of an rdeal:u:edgas can move with only certain energies.When it has one of these allowed energies,it is said to occupy a partrcularenergl level.How large will the entropy of such a gas be?If we increasethe volume of the but, we increasethe number of energylevels within a given energyrange. This increasesthe number of statesthe systemcan !e in at a given temperature,and henceit increasesthe entropy. Such an increasein entropy occursif a partition confining a gis to a portion of a containeris removed and the gas is allowed to expand suddenlyinto the whole container. If the temperatureof a gasof constantvolume is increased,the particlescan occupyenergy levels of higher energy,so more com- binations of energy levels can be occupied; this increasesthe number of states,and the ent;opy increases. If a gasis allowedto expandagainst a slowly movirg piston and 204 Symbols, Signals and Noise no heat is addedto the g&S,the number of energylevels in a given energy range increases,but the temperatureof the gas decreases jmt enough so as to keep the number of statesand the entropy the same. We seethat for a given temperature,a gasconfined to a small volume hasless entropy than the samegas spread through a larger volume. In the caseof the one-moleculegas of Figure X-3, the entropy is lesswhen the door is closedand the moleculeis confined to the ipace on one side of the piston. At least,the entropy is less if we know which sideof the piston the moleculeis on. We can easilycompute the decreasein entropy causedby halving the volume of an ideal, one-molecule,classical gas at a given temperature.In halving the volume we halvethe numberof states, and the entropy changesby an amount k log, Vz: - 0.693k The correspondingchange in free energyis the negativeof Z times this changein entropy,that is, 0.693kr This is just the work that, accordingto 10.11, we can obtain by halving the volume of the one-moleculegas and then letting it expandagainst the piston until the volumeis doubledagain. Thus, computing the changein free energyis one way of obtaining10.1 1. In reviewing our experiencewith the one-moleculeheat engine in this light, we seethat we must transmit one bit of information in order to specifyon which side of the piston the moleculeis. We must transmit this information against a background of noise corresponding to the uniform temperature T. To do this takes 0.693kT joule of energy. Becausewe now know that the moleculeis definitelyon a par- ticular side of the piston, the entropy is 0.693k lessthan it would be if we were uncertainas to which sideof the piston the molecule was on. This reductionof entropy correspondsto an increasein free energy of 0.693kTjoule. This free energywe can turn into work by aitowing the piston to move slowly to the unoccupiedend of the cylinder while the molecule pushesagainst it in repeatedim- pacts. At this point the entropy has risen to its original value,and Information Theory and Physics 205 we have obtainedfrom the systeman amount of work which, alas, is just equal to the minimum possibleene rgy required to transmit the information which told us on which side of the piston the molec.ulewas. Let us now consider a more complicated case.Suppose that a physicalsystem has at aparticular temperaturea total of m states. Supposethat we divide these states tnto .n equal groups. The number of statesin eachof thesegroups will be m/ n. Supposethat we regardthe specificationas to which one of the n groups of statescontains the state that the systemis in as a messagesource. As there are n equally likely groups of states,the communication-theory entropy of the source is tog n bits. This means that it will take n binary digits to specify the particular group of stateswhich contains the state the systemis actually in. To transmit this information at a temperature Z requiresat least .693kT Iog n - kT logun joule of energy.That is, the energyrequired to transmitthe message is proportional to the communication-theoryentropy of the mes- sagesource. If we know merely that the systemis in one of the total of m states,the entropy is k log,-m If we are surethat the systemis in one particular group of states containing only m/ n states(as we are after transmissionof the information as to which statethe systemis in), the entropy is

klog,L-k(log,m-n \ log,n)

The changein entropy brought about by information concerning which one of the n groupsof statesthe systemis in is thus _k log, n The correspondingincrease in free energy is kT log, n But this is just equal to the least energy necessary to transmit the 206 Symbols, Signals and Noise information as to which group of states contains the state the systemis in, the information that led to the decreasein entropy and the increasein free energy. We can regard any processwhich specifiessomething concern- ing which state a systemis in as a messagesource. This source generatesa messagewhich reducesour uncertainty asto what state the systemis in. Such a sourcehas a ceftain communication-theory entropy per message.This entropy is equal to the number of binary digits necessary to transmit a message generated by the source.It takes a pafiicular energy per binary digit to transmit the message against a noise correspondingto the temperature T of the system. The messagereduces our uncertainty as to what statethe system is in, thus reducing the entropy (of statistical mechanics)of the system.The reduction of entropy increasesthe free energyof the system.But, the increasein free energy is just equal to the mini- mum energy necessary to transmit the messagewhich led to the increaseof free energy, zn energyproportional to the entropy of communication theory, This, I believe,is the relation between the entropy of communi- cation theory and that of statistical mechanics.One pays a price for information which leads to a reduction of the statistical- mechanicalentropy of a system.This price is proportional to the communication-theoryentropy of the messagesource which pro- duces the information. It is always just high enough so that a perpetual motion machine of the secondkind is impossible. We should note, however,that a messagesource which generates messagesconcerning the state of a physical system is one very particular and peculiar kind of messagesource. Sources of Engti_sh lext or of speechsounds are much more commolt. It seemsirrele- vant to relate such entropies to the entropy of physics, except perhaps through the energyrequired to transmit a bit of informa- tion under highly idealized conditions. There is one very odd implicationof what we haveiust covered. Clearly, the energyneeded to transmit information about the state of a physical systemkeeps us from ever knowing the past in com- plete detail. If we can never know the past fully, can we declare itrat the past is indeedunique? Or, is this a sensiblequestion? Information Theoryand Physics 207 To summanze,in this chapter we have consideredsome of the problems of communicatirg electrically in our actual physical world. We have seenthat various physical phenomena,including lightnitg and automobileignition systeffis,produce electricaldis- turbancesor noisewhich are mixed with the electrical signalswe use for the transmissionof messages.Such noise is a sourceof error in the transmissionof signals,and it limits the rate at which we can transmit information when we usea particular signalpower and hand width. The noiseemitted by hot bodies(and any body is hot to adegree if its temperatureis greaterthan absolute zero)is a particularly simple,universal, unavoidable noise which is importantin all com- munication.But, at extremelyhigh frequenciesthere are quantum effectsas well as Johnsonor thermal noise.We have seen how theseaffect communication in the limit of infinite bandwidth,but haveno quantumanalog of expression10.4. The useof the term entropy in both physicsand communication theory has raisedthe questionof the relation of the two entropies. It can be shownin a simplecase that the limitation imposedby thermal noise on the transmissionof information results in the failure of a machinedesigned to convert the chaotic energyof heat into the organizedenergy of a lifted weight. Such a machine,if it succeeded,would violatethe secondlaw of thermodynamics.More generally,suppose we regard a sourceof information as to what state a systemis in as a messagesource. The information-theory entropy of this sourceis a measureof the energyneeded to trans- mit a messagefrom the sourcein the presenceof the thermal noise which is necessarilypresent in the system. The energy used in transmitting such a messageis as great as the increase in free energydue to the reductionin physicalentropy which the message brings about. CHAPTERXI Cybernetics

Sor"rswoRDS HAvE a heady quality; they conjure up strong feelingsof awe, mystery, or romance.Exotic used to be Dorothy Lamour in a sarong.Just what it connotescurrently I don't know, but I am sure that its meanitrg, foreign, is pale by comParison. Palimpsestmakes me think of lost volumes of Solomon'ssecrets or of bther invaluable arcanelore, though I know that the word meansnothing more than a manuscript erasedto make room for later writing. Sometimesthe spell of a word or expressionis untainted by any clear and stablemeaning, and through all the period of its currency its magic remains securefrom commonplaceinterpretations. Tao, ilan vital, and id are,I think, examplesof this. I don't believethat cyberneticsis quite such a word, but it doeshave an elusivequality as well as a romantic aura. The subtitle of Wiener's book, Cybernetics,rs Controland Com- munication in the Animal and the Machine. Wiener derived the word from the Greek for steersman.Since the publication of Wiener's book in 1948, cyberneticshas gained a wide currency. Further, if there is cybernetics,then someonemust practiceit, and cyberneticisthas been anonymously coined to designatesuch a person. What is cybernetics?If we are to judge from Wiener's book it includes at least information theory, with which we are now reasonablyfamiliar; somethingthat might be called smoothing, filtering, detection and prediction theory, which dealswith finding 208 Cybernetics 209

th. presenceof and predicting the future value of signals,usually in the presenceof noise; and negativefeedback and servomecha- nism theory, which Wiener tracesback to an early treatiseon the governor(the devicethat keepsthe speedof a steamengine con- stant) published by JamesClerk Maxwell in 1868.We must, I think, also include another field which may be described as automataand complicatedmachines. This includesthe designand programmingof digital computers. Finally, we must include any phenomenaof life which resemble anythingin this list or which embody similar processes.This brings to mind at oncecertain behavioral and regulatory functions of the body, but Wiener goesmuch further. In his secondautobiographi- cal volume, I Am a Mathematician, he says that sociology and anthropology are primarily sciencesof communication and there- fore fall under the generalhead of cybernetics,and he includes, as a specialbranch of sociology,economics as well. One could doubt Wiener'ssincerity in all this only with difficulty. He had a grand view of the importanceof a statisticalapproaCh to the whole world of life and thought. For him, a current which stemsdirectly from the work of Maxwell, Boltzmann, and Gibbs swept through his own to form a broad philosophical sea in which we find eventhe ethicsof Kierkegaard. The trouble is that each of the many fields that Wiener drew into cyberneticshas a considerablescope in itself. It would take many thousandsof words to explain the history, content, and prospectsof any one of them. Lumped together,they constitute not so much an exciting country as a diverse universe of over- whelmirg magnitude and importance. Thus, few men of scienceregard themselvesas . Should you set out to ask, one after another, each personlisted in Americqn Men oJ Sciencewhat his field is, I think that few would o'Do reply cybernetics.If you persistedand asked, you work in the field of cybernetics?"a man concernedwith communication,or with complicatedautomatic machines such as computers,or with someparts of experimentalpsychology or neurophysiologywould look at you and speculateon your background and intentions. If he decided that you were a sincere and innocent outsider, who would in any event never get more than a vague idea of his work he might well reply, "yes." 210 Symbols, Signals and Noise So far, in this country the word cyberneticshas beenused most extensivelyin the press and in popular and semiliterary,if not semiliterate, magazines.I cannot compete with thesein discussing the grander aspectsof cybernetics.Perhaps Wiener has done that besi himself in I Am a Mathematician.Even the more narrowly technical content of the fields ordinarily associatedwith the word cybernetics is so extensivethat I certainly would never try to explain it all in one book, even a much larger book than this. ln this one chapter,however, I proposeto try to give somesmall idea of the nature of the different technical matterswhich come to mind when cyberneticsis mentioned. Such a brief rdsuml may perhaps help the reader in finditg out whether or not he is inter- estedin cyberneticsand indicate to him what sort of information he should seekin order to learn more about it. Let us start with the part of cybernetics that I have called smoothing, filtering, and prediction theory, which is an extremely importani field in its own right. This is a highly mathematical su6iect,but I think that someimportant aspectsof it can be made pretty clear by meansof a practical example. Supposethat we arefaced with the problem of using radar data to point a gun so as to shoot down an airplane. The radar gives us a sequenceof measurementsof position eachof which is a little in erroi. From thesemeasurements we must deducethe courseand the velocity of the airplane, so that we can predict its position at some time in the future, and by shooting a shell to that position, shoot the plane down. Suppose that the plane has a constant velocity and altitude. Then the radar data on its successivelocations might be the crosses of Figure XI-I. We can by eye draw a line AB, which we would guessto representthe courseof the plane pretty well. But how are we to tell a machine to do this? If we tell a computing machine,or "computer," to usejust the last and next-to-lastpieces of radar data, representedby the points L and NL, it can only draw a line through thesepoints, the dashed line A'B' . This is clearlyin error. In someway, the computermust use earlier data as well. The simplest way for the computer to use the data would be to give an equal weight to all points. If it did this and fitted a straight Cybernetics 2tl

,tB' a t 2 t a a taB t

,)<

,' 2/ -,a ,4, A' / x /

i' !

,/^ / ,^ A Fig. XI-I line to all the datataken together, it might get a result suchas that shown in Figure XI-2. Clearly, the airplane turned at point T and the straight line AB that the computer computed has little to do with the path of the plane. We can seekto remedythis by givirg more importance to recent data than to older data. The simplestway to do this is by means of linear prediction. fn making a linear prediction, the computer takeseach piece of radar data(a number representingthe distance north or th-edistance east from the radut, f6t instan&; und multi- plies it by another number. This other number dependson how recent the piece of data is; it will be a smaller number for an old piece of dala than for a recent one. The computer then adds up all the productsit has obtained and so producesa predictedpiece of data (for instance,the distancenorth or eastof the radar at some future time). 2t2 Symbols, Signals and Noise

*

t' "

Ax XX ,T.A,XX T '/ A Fig. XI-2

The result of suchprediction might be as shownin Figure XI-3. Here a linear method hasbeen used to estimatea new position and direction each time a new piece of radar data,represented by a cross, becomes available.Until another piece of data becomes avarlable,the predicted path is taken as a straight line proceeding from the estimatedlocation in the estimateddirection. We seethat it takes a long time for the computer to take into account the fact that the plane has turned at the point T, despitethe fact that we are sure of this by the time we have looked at the point next aftet T. A linear prediction can make good useof old data,but, if it does this, it will be slow to respond to new data which is inconsistent with the old data, as the dataobtained after an aiqplaneturns will be. Or a linear prediction can be quick to take new data strongly into account, but in this caseit will not use old data effectively, even when the old data is consistentwith the new data.

-X */ ,/ *./ x/ x//

x-X'+-Y-x-X-T-rl

Fig. XI-3 Cybernetics 2r3 To predict well evenwhen circumstanceschange (as when the airplane turns) we must use nonlinearprediction. Nonlinear pre- diction includesall methodsof predictionin which we don't merely multiply each pieceof data used by u number dependingon how old the data is and then add the products. As a very simple exampleof nonlinear prediction, supposethat we have two different linear predictors,one of which takes into accountthe last 100pieces of data received,and the other of which takesinto accountonly the last ten piecesof data received.Suppose that we useeach predictor to estimatethe next pieceof data which will be received.Suppose that we comparethis next piece of data with the output of eachpredictor. Supposethat we rnake useof predictionsbased on 100past pieces of data only when, threetimes in a row, suchpredictions agree with eachnew pieceof data better than predictionsbased on ten past piecesof data. Otherwise,we assumethat the aircraft is maneuveringin such a way as to make long-pastdata useless,and we use predictionsbased on ten past piecesof data.This way of arriving at a final prediction is nonlinear becausethe predictionis not arrived at simply by multiplying each past pieceof data by a number which dependsonly on how old the data is. Instead,the usewe make of past data dependson the nature of the data received. More generally,there are endlessvarieties of nonlinear predic- tion. In fact, nonlinearprediction, and other nonlin earprocesses as well, are the overwhelmingtotal of all very diversemeans after the simplestcategory, linear prediction and other linear processes, have beenexcluded. A greatdeal is known about linear prediction, but very little is known about nonlinear prediction. This very specialexample of predicting the position of an ar- plane has been usedmerely to give a concretesense of something which might well seemalmost meaningless if it were statedin more abstractterms. We might, however,restate the broader problem, which has beenintroduced in a more generalway. Let us imaginea numberof possiblesignals. These signals might consistof thingsas diverse as the possiblepaths of airplanesot th. possibledifferent words that a man may utter. Let .rsilso imagine some sort of noiseor distortion. Perhapsthe radar data is ineiact, or perhapsthe man speaksin a noisy room. We are required to 2t4 Symbols, Signals and Noise estimatesome aspectof the correct signal: the presentor future position of the airplane, the word the man just spoke,or the word ittut he will speali next. In making this judgment we have some statisticalknowledge of the signal.This might concernwhat air- plane paths are most likely, or how often turns aremade, or how tttutp ih.y are.It might include what words aremost commonand howlhe hkelihood of their occurrencedepends on preceditg words. Letus supposethat we alsohave similar statisticsconcerttitg noise and distortion. We seethat we are consideringexactly the sort of data that are used in communication theory. However, given a sourceof data and a noisy channel,the communicationtheorist askshow he can messagesfrom the sourcefor transmissionover the best .rrode -\Me channel.In prediction, given a setof signalsdistorted by noise, ask, how do we best detect the true signal or estimateor predict Someaspect of it, such as its value at some future time? The armory of prediction consistsof a generaltheory of linear prediction, wbrked out by Kolmogoroff and Wiener, and mathe- matical analysesof a number of special nonlinear predictors.I don't feel that I can proceedvery profitably beyond this statement, but I can't resist giving an example of a theoreticalresult (due to David Slepian,d mathbmatician)which I find rather startling. Let us consider the caseof a faint signal which may or may not be presentin a strong noise.We want to determinewhether or not the signal is present.The noise and the signal ml8ft be voltages ot rorittd preisures. We assumethat the noise and the sign{ hlve been ro--bined simply by addirg them together.Suppose further that the signaland the noiseare ergodic (seeChapter III) and that they are band limited-that is, they contain no frequenciesoutside of a specified frequency range. Suppose further that we know exactly ttt. frequerry spectrumof the noise,that is, what fraction of th; noise po*.i fiUs in every small range of freglencies. Suppose that lttr frequency spectrum of the signal is different from this. Slepianhas iho*n thiatif we could measurethe over-all voltage (or sound pressure) of the signal plT noise exactly for .rr.tiinstant in ant interval of time, however short the interval is, *. ioold infallibly tell whether or not the signal was presentalong with the noise,no matter how faint the signal might be. This is a Cybernetics 2t5 sound theoretical,not a usefulpractical, conclusion. However, it has been a terrible shockto a lot of peoplewho had statedquite positivelythat, if the signalwas weak enough(and they statedjust how weak), it could not be detectedby eximining th; signalptut noise for any particular finite interval of time. Before leaving this general subject, I should explain why I describedit in termsof flteringand smoothingaswellis prediciton and detection.If the noise mixed with a signal has a fteqnency spectrumdifferent from that of the signal,we will help to separate the signal from the noise by using an electrical filtei which cuts down on the frequencieswhich are strongly present in the noise with _relpectto the frequencieswhich are-stronglypresent in the signal. If we usea filter which removesmost or iti ttigh frequency components(which vary rapidly with time), the output will not vary so abruptly with time as the input; we will have smoothedthe combination of signaland noise. So far, we havebeen talking about operations which we perform on a set of data in order to estimatea present or future signalor to detect a signal. This is, of course,for the purpose of Ooing something. We might, for instance, be flying an ailplane in pursuit of an enemy plane. We might use a radar to seethe enemf plane. Every time we take an observation,we might move the iottttols of the plane so as to headtoward the enemy. A devicewhich acts continually on the basis of information to attain a specified goal in the face of changes is called a seruo- mechanism.Here we havean important new element,for the radar data measuresthe position of the enemyplane with respectto our plane, and the ndar data is used in determining how tire position of our plane is to be changed.The radar data Is fed bacti rn such a way as to alter the nature of radar data which will be obtained later (becausethe dataare used to alter the position of the plane from which new radar data are taken). The feedback is called negativefeedback, because it is so used as to decreaserather than to increase any departurefrom a desiredbehavior. We can easilythink of other examplesof negativefeedback. The governor of a steamengine measures the speedof the engine.This measuredvalue is usedin opening or closirg the throttle so as to 2t6 Symbols, Signals and l'{oise keep the speedat a predeterminedvalue. Thus, the result of the measuremint of speedis fed back so as to changethe speed.The thermostat on the wall measuresthe temperatureof the room and turns the furnaceoff or on so as to maintain the temperatureat a constant value. When we walk carrying a tray of waier, we may be tempted to watch the water in the tray and try to tilt the tray so as to keep the water from spilling. This is often disastrous.The more we tili the fiay to avoid spilling the water, the more wildly the water may slosh about. When we apply feedback so as to change a processon the basis of its observedstate, the over-all situaiion may be unstable.That is, insteadof reducingsmall devia- tions from the desired goal, the control we exert may make them larger. This is a particularly hazardousmatter in feedbackcircuits. The thing we do to make corrections most complete and perfectis to make the feedbackstronger. But this is the very thing that tends to make the systemunstable. Of course,an unstablesystem is no good. An unstablesystem can result in suchbehavior as an airplane or missile veeringwildly instead of following the target,the temper- ature of a room rising and falling rapidly, an engine racing or coming to a stop, or an amplifier producing a singitg output of high amplitude when there is no input. ttre siability of negative-feedbacksystems has been studied extensively, and a great deal is known about linear negative- feedback iystems, in which the presentamplitude is the sum of past amplitudes multiplied by numbers dependingonly on remote- nessfrom the present. Linear negative-feedbacksystems are either stable or unstable, regardlessof the input signal applied. Nonlinear feedbacksystems can be stablefor someinputs but unstablefor others.A shimmying car is an exampleof a .It can be perfectlystable at agiven speedon a smooth road, and yet a singlebump canstart u shi*my which will persist indefinitely after the bump has been passed- Oddly enough,most of the early theoreticalwork on negattVe- feedback systemswas done in connection with a devicewhich has not yet been described. This is the negativefeedback amplifier, which was inventedby Harold Black in 1927. Cybernetics 217 The gain of an amplifier is the ratio of the output voltage to the input voltage.fn telephonyand other electronicarts, it is important to have amplifierswhich have a very nearly constantgain. How- ever, vacuum tubes and transistorsare imperfect devices.Their gain changeswith time, and the gain can depend on the strength of the signal.The negativefeedback amplifier reducesthe effect of suchchanges in the gain of vacuum tubes or transistors. We can seevery easilywhy this is so by examining Figure XI-4. At the top we have an ordinary amplifier with a garnof ten times. If we put in I volt, as shownby the number to the left, we get out l0 volts, as shown by the number to the right. Supposethe gain of the amplifier is halved, so that the gain is only five times, as shownnext to the top. The output alsofalls to one half, or 5 volts, in just the sameratio as the gain fell. The third drawing from the top shows a amplifier designedto give a gain of ten times. The upper box has

| | xto I to

| _! t5 l_ 5

o.fBf g 218 Symbols, Signals and Noise box is con- a high gain of one hundred times. The output ol thlt nectid io u very accuratevoltage-dividing box, which containsno tubes or transistorsand doesnot changewith time or signallevel. The input to the upper box consistsof the input voltageof I volt lesstha output of the lower box, which is 0.09times the output voltageof l0 volts; this is, of course,0.9 volt. Now, supposethe tubesor transistorsin the upper box change so that they glve a gain of only fifty times instead of one hundred times; thiiiJshown at the bottom of Figure XI-4. The numbers given in the figure are only approximate, but we seethat when the voltage falls only [ain of the upper box is cut in half the output ibout l0 peri-ent. If we had used a higher gain in the upper box the effect would have beeneven less. The importance of negative feedback can scarcelybe over- estimated.Negative feedbackamplifiers are essentialin telephonic communication. The thermostat in your home is an exampleof negative feedback.Negative feedbackis used to control chemical pt6..rsing plants and to guide missiles toward airplanes.Th. automatic pitot of an aircraft usesnegative feedback in keeping the plane on course. In a somewhatbroader sense,I use negativefeedback from eye to hand in guiding my pen acrossthe paper, and negativefeedback from ear to tongue and lips in learning to speakor in imitatitg the voice of anoiher. The animal organism makesuse of negative feedbackin many other ways.This is how it rnaintainsits tempera- ture despite changesin outside temperature, and how it maintains constant chemical propertiesof the blood and tissues.The ability of the body to miintain a narrow range of conditions despite environmental changeshas been called . G. RossAshby, one of the few self-acknowledgedcyberneticists, built a machine called a homeostatto demonstrate featuresof adjustment to environment which he believesto be characteristic of nfe. The is provided with a variety of feedback circuits and with two meansfor changing them. One is under the control of the homeostat;the other is under the control of a person 'oenvironment." who acts as the machine's If the machine'scircuits are so altered by changesof its "environment" as to make it unstable, it readjustsits circuits by trial and error so as to attain stability again. Cybernetics 219

We may if we wish liken this behavior of the homeostatto that of a child who first learnsto walk upright without falling and then learns to ride a bicycle without falling or to many otlier adjust- ments we make in life. In his book Cybernetics,Wiener puts great emphasison negativefeedback as an element of nervous control and on its failure as an explanationof disabilities,such as tremors of the hand, which areascribed to failures of a negativefeedback systemof the body. We have so far discussedthree constituents of cybernetics: information theory,detection and prediction, includirg smoothing and filtering, and negativefeedback, including servomechanisms and negativefeedback amplifiers. We usually also associate elec- tronic computers and similar complex devices with cybernetics. The word automatais sometimesused to refer to suchcomplicated machirres. One can find many precursorsof today's complicatedmachines in the computers, autom ata, and other mechanisms of earlier centuries,but one would add little to his understandingof today's gomplexdevices by studyingthese precursors. Human beingslearn by doing and by thinking about what they have done. The oppor- tunities for doing in the field of complicated machineshave-been enhancedimmeasurably beyond those of previous centuries,and the stimulus to thought has beenwonderful to behold. Recentadvances in complicatedmachines might well be traced to the invention of automatic telephoneswitchirg late in the last century. Early telephoneswitchirg systemswere of a primitive, step-by-stepform, in which a mechanism set up a new sectionof a link in a talking path as eachdigit was dialed.From this, switch- itg systemshave advancedto becomecommon-control systems. In a common-control switchirg system,the dialed number doesnot operate switches directly. It is first stored, or representedelec- trically or mechanically,in a particular portion ol the switching system.Electrical apparatusin another portion of the switching systernthen examinesdifferent electricalcircuits that could be used to connectthe calling pafiy to the number called,until it finds one that is not in use.This free circuit is then used to connectthe calling party to the called party. Modern telephoneswitchirg systemsare of bewildering com- plexity and overwhelmingsize. Linked together to form a nation- 220 Symbols, Signals and Noise wide which allows dialing calls clear acrossthe country, ihey areby far the most complicatedconstruction of man. lt wouid tuk" many words to explain how they perform evena few of their functions. Today,a few pulls of a telephonedial will cause telephoneequipment to seekout the most economicalavailable path to a diJtant telephone,detouring from city to city if direct paths are not available.The equipmentwill establisha connection, iittg the part/, time the call, and record the chargein suitableunits, uttd it will disconnect the circuits when a party hangs up. It will also report malfunctioning of its parts to a central location, and it contin.resto operate despitethe failure of a number of devices. One important componentof telephoneswitchitg systemsis the electric ,ilay. The principal elements of a are an electro- magnet, a magnetic bar to which various movable contacts ate attiched, and fixed contacts which the movable contacts can touch, thus closing circuits. When an electric current is passed through the coil ol the electromagnetof the relay,the magnetic bar is ittracted and moves.Some moving contactsmove away from the correspondingfixed contacts,opening circuits; other moving contacts are brought into contact with the correspondingfixed contacts,closing circuits. In the thirties-,G. R. Stibrtz of the Bell Laboratoriesapplied the relays and other componentsof the telephoneart to build a com- ptei calculator, whicli could add, subtract, multiply, and divide .o*plex numbers. During World War II, a number of more com- plicated relay computers were built for military purpotgt by the bel Laboraiories, while, in 1941, Howard Aiken and his co- workers built their first relay computer at Harvard. An essential step in increasing the speed of computers *?t taken shortly after the war when J. P. Eckert and J. W. Mauchly built the Eniac, a vacuum tube computer, and more recently transistorshave been usedin place of vacuum tubes. Thus, it was an essentialpart of progressin the field of complex machinesthat it becamepossible to build them and that they were built, first by using relais and then by using vacuum tubes and transistors. The building of such complex devices,of course,involved more than the existence of the elements themselves;it involved their Cybernetics 221 interconnectionto do particular functions such as multiplication and division. Stibitz'sand Shannon'sapplication of Booleanalge- bra, a branch of mathematicallogic, to the descriptionand design of logic circuitshas beenexceedingly important in this connection. Thus, the existenceof suitablecomponents and the art of inter- connectingthem to carry out particular functions provided, so to speak,the body of the complicatedmachine. The organtzation,the spirit, of the machineis equally essential,though it would scarcely have evolvedin the absenceof the body. Stibitz's complexcalculator was almost spiritless.The operator sent it pairs of complexnumbers by teletype,and it cogitatedand sent back the sum, difference,product, or quotient. By 1943, however,he had madea relaycomputer which receivedits instruc- tions in sequenceby meansof a long paper tape, or program, which prescribedthe numbersto be usedand the sequencesof operations to be performed. A step forward was taken when it was made possible for the machine to refer back to an earlier part of the program tape on completing a pafi of its over-all task or to use subsidiurytapes to help it in its computations.In this casethe computer had to make a decisionthat it had reacheda certain point and then act on the basisof the decision.Suppose, for instance,that the computerwas computing the value of the followirg series by adding up term after term: I Y3+Ys Vt+Yg Yrt+... We might program the computer so that it would continue adding terms until it encountereda term which was less than | / 1,000,000 and then print out the result and go on to some other calculation. The computer could decide what to do next by subtracting the latest term computedfrom | / 1,000,000.If the answerwas negative, it would compute another term and add it to the rest; if the answerwas positive,it could print out the sum arrived at and refer to the program for further instructions. The next big stepin the history of computersis usually attributed to ,who madeextensive use of early computers in carryitg out calculationsconcerning atomic bombs. Even early computershad memories,ot stores,in which the numbers usedin 222 Symbols,Signals and I'{oise intermediate steps of a computation were retained for further processing and in which answers were stored prior to printing them out. Von Neumann's idea was to put the instructions, or program, of the machine, not on a separate paper tape, but right into the machine's memory. This made the instructions easily and flexibly available to the machine and made it possible for the machine to modify parts of its instructions in accordance with the results of its computations. In old-fashioned desk calculating machines, decimal digits were stored on wheels which could assume any of ten distinct positions of rotation. In today's desk and hand calculators, numbers are stored in binary form in some of the tens of thousands of solid- state circuit elements gf a large-scale integrated circuit chip. To retain numbers in such storage, electric power (a minute amount) must be supplied continually to the integrated circuit. Such storage of data is sometimes called volatile (the data disappears when power is interrupted). But, volatile storage can be made very re- liable by guarding againstinterruption of Power. Magnetic core memory-arrays of tiny rings of magnetic material with wires threaded through them provide nonvolatile metnory in computers larger than hand calculators and desk calculators. An interruption of power does not change the magnetization of the cores, but an electric transient due to faulty operation can change what is in memory. Integrated circuit memory and core memory are random access memory. Any group of binary digits (a byte is 8 consecutivedigits), or a word (of 8, 16,32 or more digits) can be retrieved in a frac- tion of a microsecond by sending into the memory a sequenceof binary digits constituting an address. Such memory is called random accessstorage. In the early days, sequencesof binary digits were stored as holes punched in paper tape. Now they are stored as sequences of magnetized patterns on magnetic tape or magnetic disks. Cassettes similar to or identical with audio cassettesprovide cheap storage in small "recreational" computers. Storage on tape, disk or cassette is sequential; one must run past a lot of bits or words to get to the one you want. Sequential storage is slow compared with random access storage. It is used to store large amounts of data, or to Cybernetics 223 store programs or data that will be loaded into fast, random access storage repeatedly. Tape storage is also used as back up storage, to preserve material stored in random access storage and disk storage that might be lost through computer malfunction. Besides memory or storage, computers have arithmetic units which perform various arithmetical and logical operations, index registersto count operations, input and output devices (keyboards, displays and printers ) control units which give instructions to other units and receive data from them, and can have many specialized pieces of hardware to do specialized chores such as doing floating point arithmetic, performing Fourier analysesor inverting matrices. The user of a computer who wishes it to perform some useful or amusing task must write a program that tells the computer in ex- plicit detail every elementary operation it must go through in attaining the desired result. Becausea computer operates internally in terrns of binary numbers, early programmers had to write out long sequencesof binary instructions, laboriously and fallibly. But, a computer can be used to translate sequencesof letters and decimal digits into strings of binary digits, accordirg to explicit rules. Moreover, subroutinescan be written that will, in effect, cause the computer to caffy out desirable operations (such as dividing, taking square root, and so on ) that are not among the basic com- mands or functions provided by the hardware. Thus, assemblers or assembly languagewas developed. Assembly language is called machine language, though it is one step removed from the binary numbers that are stored in and direct the operations of the hardware. When one writes a program in assemblylanguage, he still has to write every step in sequence, though he can specify that various paths be taken through the program depending on the outcome of one step (whether a computed number is larger than, smaller than or equal to another, for example) . But, the numbers he writes are in decimal form, and the instructions are easily remembered se- quences of letters such as CLA (clear add, a command to set the accumulator to zero and add to it the number in a specified memory location). Instructing a computerin machine (assembly)language is a laborioustask. Computersare practicalbecause they have operat- 224 Symbols, Signals and Noise irg systems,by means of which a few simple instructions will cause data to be read in, to be read out, perform other useful functions, and because programs (even operating systems themselves) are written in a higher-order language. There are many higher-order languages.Fortran (formula trans- lation) is one of the oldest and most persistent. It is adapted to numerical problems. A line of Fortranmight read

40 IF SrN(X) < M THEN 80

The initial number 40 is the number of the program step. It instructs the computer to go to program step 80 if sin x is less thanM. Basic is a widety used language similar to but somewhat simpler than Fortran. The C language is useful in writing operating systems as well as in numerical calculations. Languages seem to be easiest to use and most efficient (in computer running time) when they are adapted to particular uses, such as numerical calculations or text manipulation or running simulations of electrical or mechani- cal or economic systems. One gets from a program in a higher-order language to one in machine languageinstructions by meansof a compiler, which trans- lates a whole program, or by an interpreter, which translates the program line by line. Compilers are more efficient and more com- mon. One writes programs by means of a text editor (or editing program ) which allows one to correct mistakss-fe revise the pro- gram without keying every symbol in afresh. Today many children learn to program computers in grade school, more in high school and more still in college. You can't get through Dartmouth and some other colleges without learnitg to use a computer, even if you are a humanities maior. More and more, computers are used for tasks other than scien- tific calculations or keeping books. Such tasks include the control of spacecraft and automobile engines, the operation of chemical plants and factories, keeping track of inventor/, making airline and hotel reservations, making word counts and concordances, analyzirg X-ray data to produce a three-dimensional image (CAT - computer-aided tomography), composing (bad) music, Pro- Cybernetics 225

ducing attractive musical sounds, producing engaging animated movies, playing games, new and old, includirg , reading printed or keyed-in text aloud, and a host of other astonishing tasks. Through large-scale integration, simple computers have become so cheap that they are incorporated into attractive toys. While children learn a useful amount of programming with little difficulty, great talent and great skill are required in order to get the most out of a computer or a computerlike device, or to accom- plish a given task with a smaller and cheaper computer. Today, far more money and human effort are spent in producing software (that is, programs of all sorts) than are spent on the hardware the programs run on. One talented programmer can accomplish what ten or a hundred merely competent programmers cannot do. But, there aren't enough talented programmers to go around. fndeed, programmers of medium competence are in short supply. Even the most talented programmers haven't been able to make computers do some things. In principle, a computer can be pro- grammed to do anything which the programmer understands in detail. Sometimes,of course, a computation may be too costly or too time-consuming to undertake. But often we don't really know in detail how to do somethirg. Thus, things that a computer has not done so far include typing out connected speech, translating satis- factorily from one language to another, identifying an important and interesting mathematical theorem or composing interesting music. Efforts to use the computer to perform difficult tasks that are not well understood has greatly stimulated human thought concern- itg such problems as the recognition of human words, the structure of languege, the of winning games, the structure of music and thq processesof mathematical proof. Further, the programming of cbmputers to solve complicated and unusual problems has given us a new and objective criterion of understanding. Today, if a man says that he understands how a human being behaves in a given situation or how to solve a certain mathematical or logical problem, it is fair to insist that he demonstrate his understanding by programming a computer to imitate the behavior or to accomplish the task in question. If he 226 Symbols, Signals and l,{oise is unable to do this,his understandingis certainlyincomplete, and it may be completelyillusory. Will computersbe able to think? This is a meaninglessquestion unlesswe say what we mean by to think. Marvin Minsky, a free- wheeling mathematicianwho is much interestedin computersand complex machines,proposed the followirg fable. A man beats everyoneelse at chess.People sa), "How clever,how intelligent, what a marvelousmind he has, what a superbthinker he is." The man is asked,"How do you play so that you beat everyone?"He says,"I have a setof ruleswhich I usein arrivirg at my next move." People are indignant and sa), "Why that isn't thinking at all; it's just mechanical." Minsky's conclusionis that people tend to regard as thinking only such things as they don't understand.I will go even further and say that people frequently regard as thinking almost any grammaticaljumblirg togetherof 'oimportant"words. At timesI'd settlefor a useful,problem-solving type of "thinking," evenif it was mechanical. In any event, it seemslikely that philosophersand humanists will manageto keep the definition of thinking peqpetu- ally applicable to human beings and a step aheadof anything a machine ever managesto do. If this makesthem huppy, it doesn't offend me at all. I do think, however,that it is probably impossible to specify a meaningful and explicitly definedgoal which a man can attain and a computer cannot, evenincluditg the "imitation game" proposedby A. M. Turing, a British logician,in 1936. In this game a man is in communication,say by teletype,with either a computer or a man, he doesn'tknow which. The man tries by means of questionsto discover whether he is in touch with a man or a machine; the computer is programmedto deceivethe man. Certainly, however, a computer programmed to play the imitation game with any chanceof successis far beyond today's computersand today'sart of programmirg, and it belongsto a very distant future, if to any. We have seen that cyberneticsis a very broad field indeed.It includes communication theory, to which we are devoting a whole book. It includes the complicatedfield of smoothingand predic- tion, which is so important in radar and in many other military applications. When we try to estimate the true position or the Cybernetics 227 future position of an airplane on the basis of imperfect radar data, we ate, accordirg to Wiener, dealing with cybernetics. Even in using an electrical filter to separate noise of one frequency from signals of another frequenc), we are invoking cybernetics. It is in this general field that the contribution of Wiener himself was greatest,and his great work was a general theory of prediction by means of linear devices, which makes a prediction merely by multiplying each piece of data by a number which is smaller the older the data is and adding the products. Another part of cybernetics is negative feedback. A thermostat makes use of negative feedback when it measures the temperature of a room and starts or stops the furnace in order to make the temperature conform to a specified value. The autopilots of air- planes use negative feedback in manipulating the controls in order to keep the compass and altimeter readings at assigned values. Human beings use negative feedback in controlling the motions of their hands to achieve certain ends. Negative feedback devices can be unstable; the effect of the output can sometimes be to make the behavior diverge widely from the desired goal. Wiener attributes tremors and some other malfunctioning of the human being to improper functioning of negative feedback mechanisms. Negative feedback can also be used in order to make the large output signal of an amplifier conform closely in shape to the small input. Negative feedback amplifiers were extremely important in communication systems long before the duy of cybernetics. Finall),.cybernetics has laid claim to the whole field of automata or complex machines, includirg telephone switchirg systems, which have been in existence for rnany years, and electronic computers, frhich have been with us only since World War II. If all this is So, cybernetics includes most of the essenceof modern technology, excludirg the brute producton and use of power. It includes our knowledge of the organrzatton and function of man as well. Cybernetics almost becomes another word for all of the most intriguitg problems of the world. As we have seen, Wiener includes sociological, philosophical, and ethical problems among these. Thus, even if a man acknowledged bei ng a , that 228 Symbols, Signals and Noise wouldn't give us much of a clue concerninghis field of competence, unlesshe was a universalgenius. Certainly, it would not necessarily indicate that he had much knowledgeof information theory. Happily, as I have noted, few scientistswould acknowledge themselvesas cyberneticists, save perhaps in talking to thosewhom they regard as hopelesslyuninformed. Thus, if cyberneticsis over- extensiveor vague,the overextensionor vaguenesswill do no real harm. Indeed, cyberneticsis a very useful word, for it can help to add a little glamor to a person,to a subject, or evento a book. I certainly hope that its presencehere will add a little glamor to this one. cHAPTERXII Information Theory and Ptycholog

I HAVEREAD a good deal more about information theory and psychologythan I canor careto remember.Much of it wasa mere associationof new terms with old and vague ideas.Presumably the hope was that a stirrirg in of new terms would clarify the old ideasby u sort of sympatheticmagic. Someattempted applications of information theory in the field of experimentalpsychology have, however, been at leastreasonably well informed. They have led to experimentswhich produced valid data.It is hard to draw any conclusionsfrom thesedata that are both sweepirgand certain,but the data do form a basisor at least an excusefor interestingspeculations. In this chapter,I propose to discusssome experimentsconcerning information theory and psychologywhich are at leastdown-to-earth enough to grapplewith. Naturally I havechosen these largely on the basis of my personalinterest and background,but one has to impose somelimitations in order to say anything coherentabout a broad and lessthan pellucid field. It seemsto me that an early reactionof psychologiststo infor- mation theory was that, as entropy is a wonderful and universal measureof amountof information and as human beingsmake use of information, in someway the difficulty of a task, perhapsthe time a man takesto accomplisha set task, must be proportional to the amount of information involved. 229 230 Symbols, Signals and Noise This idea is very clearlyillustrated in someexperiments reported by Ray Hyman, an experimentalpsychologist, in the Journalof Experimental Psychologttn 1953.Here I shall describeonly one of severalof the experimentsHyman made. A number of lights wereplaced before a subject,as psychologists call an experimenteeor laboratory human animal. Each light was labeledwith a monosyllabLc"rLame" with which the subjectbecame familiar. After a warning signal, one of the severallights flashed, and the subjectthereafter spoke the name of the light as soonas he could. The time interval betweenthe flashingof the light and the speakingof the name was measured. Sometimesone out of eight lights flashed,the light beingchosen at random with equal probabilities.In this case,the information conveyedin enablingthe subjectto identify the light correctlywas log 8, or 3 bits. Sometimesone among 7 light flashed(2.81 bits), sometimesamong 6 (2.58bits), sometimesone out of 5 (2.32bits), one out of 4 (2.00bits), one out of 3 (1.58bits), one out of 2 (l bit), or one out of I (0 bits). The averageresponse time, or latency, between the lighting of the light and the speakingof its namewas plotted againstnumber of bits, as shownin Figure XII-I. Clearly, there is a certainlatency, or responsetime, evenwhen only one light is used,the choice amonglights is certain,and the information conveyedas to which light is lighted is zero. When more lights are used,the increasein latencyis proportional to the information conveyed.Such an increaseof latency with the loga- rithm of the number of alternativeshad in fact been noted by u German psychologist,J. Merkel, in 1885.It is certainlyu strikingly accurate,reproducible, and a significantaspect of human response. We note from Figure XII-I that the increasein latencyis about 0.15 secondper bit. Someunwary psychologistshave jumped to the conclusion that it takes 0.15 secondfor a human being to respondto I bit of information; therefore,the information capacity of a human beingis about 1/ .15,or about 7 bits per second.Have we discovered a universal constant of human perception or of human thought? Clearly, in Hyman's experiment the increasein latency is pro- portional to the uncertainty of the stimulus measuredin bits. However, various experiments by various experimentersgive Information Theory and Psycholog 23r 800

O EXPERIMENT 700 a o t'Jl z o 600 r) UJ o ,/ 500 ) q5 o / = 400 ./ lLJ / I F 300 I z o F u 200 UJ tr. foo

o .5 f.o f.5 2.o 2.5 3.O 3.5 BITS PER STIMULUSPRESENTATION Fig. XII-I somewhatdifferent ratesof increasein secondsper bit. Moreover, data publishedby G. H. Mowbray and M. V. Rhoadesin 1959 show that, after much practice,a subject'sperformance tends to changeso that there is little or no effect of on latency.It appearsthat human beingsmay have differentways of handling information, a way used in learnirg, in which number of alternativesis very important, and a way usedafter much learn- ing, in which number of alternatives,up to a fairly large number, makeslittle difference.Further, in one sort of experiment,in which a subjectdepresses one or more keys on which his fingersrest in responseto a vibration of the key, it appearsthat there may be little increasein latency with amount of information right from the start. 232 Symbols, Signals and Noise Moreover,even if the latencywere a constantplus an increment proportional to information content, one could not reasonably assertthat this showedthat a significantinformation rate can be obtainedby dividing the increasedtime by the number of bits.We will seethat this can lead to fantastic information ratesin the sort of experimentwhich I shall describenext. H. Quastler made early information-rateexperiments in which subjectsplayed random sequencesof notesor chordsor readlists of randomly chosen words as rapidly as possible,and J. C. R. Licklider did early work on both reading and pointing speed. Before we heard of this work, J. E. Karlin and I embarkedon an extensiveseries of experimentson reading lists of words,which of all experimentsgives the highestobserved information rate,a rate which is much higher than, for instance, sending Morse code or typitg. 'osend Supposethe er" of the messagechooses an alphabetof , sa], 16words and makesup a list by choosingwords amongthese randomly and with equalprobabilities. Then, the amount of choice in designatingeach word is log 16,or 4 bits. The subject"trans- mits" the information, translatingit into a new form, speechrather than print, by reading the list aloud. If he can read at a rate of 4 words a second,for instance,he transmits information at a rate of 4 X 4, or 16bits per second. Figure ){II-} shows data from three subjects.The words were chosenfrom the 500 most common words in English.We seethat while the reading rate drops somewhatin going from 2 to 4 word vocabularies(or from I to 2 bits per word), it is almost constant for vocabularies or alphabets containing from 4 to 256 words (from 2 to 8 bits per word). Let us now rememberthe allegedmeans for gettingan infonna- tion rate from such data as Hyman's, that is, notitg the increase in time with increasein bits per stimulus. Consider the dotted average data curve of Figure XII-2. In going from 2 bits per stimulus to 8 bits per stimulusthe reading rate doesn'tdecrease at all; that is, the changein reading time per word is 0, despitean. increaseof 6 in the number of bits per word. If we divide 6 by 0, we get an information rate of infinity! Of course,this is ridiculous, but it is scarcelymore ridiculous than deducitg an information Information Theoryand Psycholog 233

t .') n{ ni *3b956 6

@ 'ft N I t -XI o N

o 3H tr o o 3 o o tr t. tr, G,\ trl TU UJ N fro o o o r,J . o D :gP lrf UJ UJ o\ o E, E, 0c t U o o I

g qg ,t? g \t '')o N (\ oNof,3s u3d souo/n 234 Symbols,Signals and Noise rate from such dataas Hyman's by dividing increasein numberof bits by increasein latency. Directly from Figure XII-Z, we can seethat as reader A reads 8-bit words at a rate of 3.8 per second,he managesto transmit information at a rate of 8 X 3.8, or about 30 bits per second. Moreover,when the wordsput in the list arechosen randomly from a 5,000word dictionary (12.3bits per word), he managesto read them at a rate of 2.7 per second,giving a higher information rate of 33 bits per second. It is clear that no uniqueinformation ratecan be usedto describe the performanceof a human being.He can transmit(and, we shall see, respond to or remember) information better under some circumstancesthan under others.We can best considerhim as an information-handling channel or device having certain built-in limitations and properties. He is a very flexible device; he can handle information quite well in a variety of forms, but he handles it bestif it is propedy encoded,properly adjustedto his capabilities. What are his capabilities?We seefrom Figure XII-} that he is sloweddown only a little by increasingcomplexity. He can read a list of words chosenrandomly from an alphabetof 256about as fast as words chosenfrom an alphabetof 4. He isn't very speedy comparedwith machines,and in order to make him perform well we must give him a complex task. This is just what we might have expected. Complexity eventuallydoes slow him down, however,as we see from the points for an alphabet consisting of all the words in a 5,000word dictionary.Perhapsthere is an optimum alphabetor vocabulary, which has quite a number of bits per word, but not so many words as to slow a man down unduly. Partly to help in finding such a vocabulary,Karlin and I measuredreading rate as a function of both number of syllablesand "fami[ artty," that is, whether the word came from the first thousand in order of commonnessof occurrenceor famili anty, from the tenth thou- sand, or from the nineteenth thousand. The results are shown in Figure XII-3. We seethat while an increasein number of syllablesslows down reading speed, & decreasein famili arrty hasjust as pronouncedan effect. Thus, a vocabulary of familiar one syllablewords would Information Theoryand Psycholog 235 5.O (a) READER A 4,5 FAMILIARITY 4.O \ - o | !rOOO a lo,ooo-t tpoo 3.5 tr | 9'OOO-2O'OOO

3.O \ \ \ 2.3 \., L. \

2.O -l \ \= -r- o f.5 z o 4.0 u ul (b) READERB 3n 3.5 \ E uJ o. 3.O \ 3n o t: G o 2,5 \ = \ 2,O

f.5 \ 4.O (c) READERc 3.5 \

3.O \ , 2.5 _\ \ \ 2,o

1.5 234 SYLLABLESPER WORD Fig. XII-3 236 Symbols, Signals and Noise seemto be a good choice.Using the 2,500most commonmono- syllables(2,50b words means I 1.3 bits per word) as a "preryrred vocabulary," a readerattained a readingspeed of 3.7 wordsper second,giving an information rate of 42 bits per second. "scrambled prose," that is, words chosenwith the sameprob- abilities as in nontechnicalprose but picked at random without grarnmatical connection,also gave a high information rate.The entropy is about I 1.8bits per word, the highestreaditg ratewas 3.7 wbiOs per second,and the correspondinginformation rate is 44 bits per second. Perhips one could gain a little by improving the alphabet,but I don't think one would gain much. At any rate,these experiments gavethe highestinformation rate which has been demonstrated. it is a rate slow by the standardsof electricalcommunication, but it doesrepresent a tremendousnumber of binary choices-around 2,500a minute! What, we may ask, limits the rate?Is it reading through each word letter by letter?In this casethe Chinese,who have a single sign for eachword, might be better off. But Chinesewho readboth English and Chinesewith facility read randomrzedlists of common Chinese charactersand randomized lists of the equivalentEnglish words at almost exactlythe same speed. Is the limitation a mechanicalone? Figure XII-4 showsrates for severaltasks. A man can repeat a memorizedphrase over twice as fast as he can read randomized words from the preferredlist, and he can read prose appreciably faster. It appearsthat the limitation on readit g rate is mental rather than mechanical. So far, it appearsthat we cannot characterrze a hurnan being by means of i particular information rate.While the difficulty of a task ultimateiy increaseswith its information content, the diffi- culty dependsmarkedly on how well the task is tailored to human abilities. the human being is very flexible in ability, but he has to strain and slow down to do unusual things. And he is quite good at complexity but only fair at speed. One way of tailoring a task to human abilities is by deliber?t", thougntfui experiments.This is analogousto the processof so ercodirg mesiagesfrom a messa1e source as to attatn the highest possible rate of information transmissionover a noisy channel. Information Theory and Psycholog 237

o z 86 ru o o-trs o ft+ o =

o READER A BC Fig. XII-4

This was discussedin Chapter VIII, and the highest attainable rate was called the channelcapacity. The "preferred list" of the 2,500most frequentlyused monosyllables was devisedin a delib- erate effort to attain a high information rate in reading aloud randomaed lists of words. We may note,however, that choosingwords at random with the probabilities of their occurrencein English text gives as high or a little higher information rate. Have the words of the English lan- guageand their frequenciesof occurrencebeen in someway fitted 238 Symbols, Signals and l'{oise to human abilitiesby u long processof unconsciousexperiment and evolution? We have seenin Chapter V that the probability of occurrence of a word in English teit is very nearly inverselyproportional to its rank. That is, the hundredth most common word occursabout a hundredth as frequently as the most common word, and so on. Figure V-2 illustratesthii relation, which was first pointed out by GJorge Kingsley Zipf,who ascribedit to a principle.ofleast effort. Clelrly, i,tpf't liw cannot be entirely correct in this simple form. We saw in Chapter V that word probabilities cannot be inversely proportional [o the rank of the word for all words; if they were, the-sumof the probabilities of all words would be greater than unity. There have been various attempts to modify, derive, and explainZipf's law, and we will discussthese somewhat later. Howeu-er,we will at first regard Zipf's law in its original and simplestform asan approximatedescription of an aspectof human behavior in generatilAlangua ga, & descriptionwhich Z\pf arrived at empiricalV by examining the statisticsof actual text. Zipf,as we havenoted, associated his law with a prilciple of least effoit. Attempts have been made to identify the effort or "cost" of producing text with the number of letters in text. However, -oit linguiJtr t"gard language primarily as the spoken language, and it seemsunlikely that speaking,reading, or writing habits are dictated primarily by the numbers of letters used in words. In fact, we noied in the information-rate experimentswhich we just consideredthat reading rates are about the samefor common Chineseideographs and for the equivalentwords in Engllsh_wlitten out alphabettally. Further, we have noted from Figure XII-3 that commonness or familiarity has an influence on reading time as vgreat as doesnumber of sYllables. Could we not, perh&ps,lake reading time as a measureof effort? We might think, for instance, that common words are more easily accessibleto us, that they can be recognizedor called forth with lesseffort or costthan uncommon words.Perhaps the humanbrain is so organtzed that a few words can be stored in it in such a fashion that they can be rec ognLzedand called forth easilyand that many more can be stored in a fashion which makes their useless easy. We might believethat reading time is a measureof acces- sibility, easeof use,of cost. Information Theory and Psychology 239 We might imagine,further, that in usinglanguage, human beings choosewords in such a way as to transmit as much information as possiblefor a given cost.If we identify cost with time of utter- ance,we would then say that human beingschoose words in such a way as to convey as much information as possible in a given time of speakingor in a given time of readirg aloud. It is an easymathematical task to show that if a speakingtime /" is associatedwith the rth word in order of commonness,then for a messagecomposed of randomly chosenwords the informa- tion rate will be greatestif the rth word is chosenwith a probability p (r) givenby p(r) - 2-"t, (12.1) Here c is a constant chosento make the sum of the probabilities for all words add up to unity. This mathematicalrelation saysthat words with a long reading time will be used lessfiequently than words with a short reading timeoand it gives the exact relation which must hold if the information rate is to be maximved. 's Now, if Zipf law holds, the probability of occurrenceof the rth word in order of commonnessmust be given by

p(r) _ 4 (12.2) r Here A is anotherconstant. Thus, from l2.l and 12.2we must have A 2-"t, (t2.3) r By usinga relationgiven in theAppe+dix, this relation can be re-expressed l:JA E, fj t - ctr tr: A + blogr (t2.4) Here a and b are constantswhich must be determinedby exam- ining the relation of the reading time t, and the order of common- nessor rank of a word, r.If Zipt's law is true and if the information rate is maximizedfor words chosenrandomly and independently 's with probabilitiesgiven by Zipf law, then relation li.4 should hold for experimentaldata. Of course,words aren't chosenrandomly and independentlyin constructing English text, and hence we cannot say that word 240 Symbols, Signals and Noise probabilities in accordwith relation l2.l actuallywould mar,,trnze information transmissionper unit time. Nonetheless,it would be interesting to know whether predictionsbased on a random and independent choice of words do hold for the reading of actual English text. Benoit Mandelbrot, a mathematician much interestedin lin- guistic problens, has consideredthis matter in connectionwith ieading-time data taken by D. H. Howes, afr experimentalpsy- chologist.R.R. Riesz,an experiencedexperimenter in the field of psychophysics,and I have also attempted to compare equation L2.4with human behavior. There is a difficulty in making such a comparison. It seems fairly clear that reading speedis limited by word recognition,not by word utterance.A man may be uttering a long familiar word while he is recognizinga short, unfamiliar word. To get around this difficulty it ieemedbest to do someaveragirg by measuring the total time of utterance for three successivewords and then comparing this with the sum of the times for the words computed by meansof 12.4. Riesz ingeniouslyand effectivelydid this and obtained the data of Figure XII-5. In the test, a subjectread a paragraphas fast as possible.Certainly, a straight line accordingto 12.4fits the data is well as any curvewould. But the points aretoo scatteredto prove that 12.4really holds. Moreover, we shouldexpect such a scatter,for the rank r corre- sponds to commonnessof occurrencein prosefrom a variety of sburces,but we have usedit as indicating the subject'sexperience with and familiarity with the word. Also, as we seefrom Figure XII-3, word lengthmay be expectedto havesome effect on reading time. Finally, we have disregardedrelations among successive words. This sort of experimentis extremelyexasperating. One can see other experiments which he might do, but they would be time consuming,and thereseems little chancethat they would establish anything of general significance in a clear-cut way. Perhapsa genius*itt unravel the situation some day, but the wary psycholo- a [ist is more apt to seek a field in which his work proniises definite, unequivocaloutcome. Information Theoryand Psycholog 24r

I o

I o o) o o \

I \ o a o -a- t I o \ \

( k o o. ng3 JX o \ rH'f \i-lo

\

\

!

g o q\torrt\tc') N:O o ooctooci oo soNof3s Nt 31^ftl gNtOV3H 242 Symbols, Signals and Noise The foregoirg work doesat least suggestthat word usagemay be governedby economy of effort and that economyof effort may be measured as economy of time. We still wonder, however, whether this is the outcome of a trained ability to cope with the English language or whether languagesomehow becomes adapted to ihe mental abilities of people.What about the number of words we use, for instance? Peoplesometimes measure the vocabulary of a writer by the total number of different words in his works and the vocabularyof an individual by the number of different words he understands.How- ever, rare and unusual words make up a small fraction of spoken or written English. What about the words that constitutemost of language?How numerous ate these? One might assertthat the number of words used should reflect the complexity of life and that we would needmore in Manhattan than in Thule (before the Air Force, of course).But, we always have the choice of using either different words or combinationsof common words to designateparticular things. Thus, I can say o'the either "the blonde girl," redheadedgirl," "the brunette girl" o'the or I can say girl with light hair," "the girl with red hair," "the girl with dark hair." In the latter case,the words with, light, dark, red, and hair servemany other purposes,while blonde,redheaded, and brunette are speciahzedby contrast. Thus, we could constructan artificiallanguage with eitherfewer or more common words than English has, and we could useit to say the same things that we say in English. In fact, we can if we wiitr regard the English alphabetof twenty six lettersas a reduced languageinto which we can translate any English utterance. Perhaps,however , all languagestend to assumea basic sizeof vocabulary which is dictated by the capabilitiesand organization of the human brain rather than by the seemingcomplexity of the environment. To this basic language,clever and adaptablepeople can, of course, add as many special and infrequently usedwords as they desire to or can remember. Zipfhas studiedjust this matter by meansof the graphsillustrat- ing his law. Figure XII-6 1 shows frequency (number of times a

1 Reproduced from George Kingsle y Zipf , Human Behavior and the Principle of Least h6ort, Addison-Wesley Publishing Compan), Reading, Mass., 1949- Information Theory and Psycholog 243

torooo \

\

\

U z UJ f g \^ H loo IL \

\ \ N\

I l\ I to too tooo torooo RANK Fig. XII-6

word is used) plotted against rank (order of commonness)for 260,430runnitg words of JamesJoyce's Ulysses(curve A) and for 43,989running wordsfrom newspapers(curve B). The straight line 's C illustratesZipf ideahzedcurve or 'olaw." Clearly, the heightsof ,4 and B aredetermined merely by the number of words in the sample; the slope of the curve and its constancy with length of sample are the important things. The stepsat the lower right of the curves,of course,reflect the factthat infrequent words can occur once, twice, thrice, and so on in the samplebut not 1.5or 2.67times. When we ide aLtzesuch curves to a 45" line, as in curve C we note that more is involvedthan the mereslope of the line. We start our frequency measurementwith words which occur only once; that is, the lower right hand corner of the graph representsa fre- 244 Symbols, Signals and Noise rank scalestarts with l, quency of occurrenceof l. Similarly, th. ttt. rank assignedto the most frequently used word. Thus, vertical and horizonial scalesstart as l, and equal distancesalong them were chosen in the first place to represent equal increasesin number. We seethat the 45' Zipf-Iaw line tells us that the number of differentwords in the samplemust equal the numberof occurrences of the mostfrequentlY used word. We can go further and say that If Zipt's law holds in this strict and primitive form, a number of words equal to the squarerooJ makeup half of the numberof ffirent wordsin the passagewill of all the words in itre sample.In Figure XII-7 the numberN of different words and the number V of words constituting half the passageare plotted againstthe number L of words in the passage. torooo v / I ooo /

o o tr o 3 roo i- lr "o UI o z a f V o I F

o.l to2 t03 roa lo5 LIN THOUSANDS OF WORDS Fig. XII-7 Information Theoryand Psycholog 245 Here is vocabulary limitation with a vengeance.In the Joyce passage,about 170 words constitute half the text. And Figure XII-6 assuresus that the samething holds for newspaperwrilngl Zipf gives curves which indicate that his law troiAs well for Gothic, if one countsas a word anythingspaced as a word. It holds fairly well for Yiddish and for a numb-er of Old German and Middle High German authors,though some irregularities occur at the upper left part of the curve. Curvesfor Noiwegian tend to be steeperat the lower right than at the upper left, anOptains Cree gives a line_havingonly about three-fourili the slope oi the Zipf 's law line. This means a greater number of differbnt words in a given length of text-a larger vocabulary.Chinese characters glve a curve which zoomsup at the left, indicating a smaller vocabulary. Nonetheless,the remarkablething is the iimilarity exhibited by languages. {l The implication is thit the variety utrOprobability distribution of wordsis pretty much the samefoi maoy, if not ali, written languages.Perhaps languages do necessarily iaupt them- selvesto a pattern dictated by the human mentat aLititiei, by the structure and organizationof the human brain. Perhapseveryone notices and speaksabout roughly the samenumber of featurLsin his environment.An Eskimo in the bleak land of the north man- agesthis by distinguishitg by word and in thought among many types of snow; an Arab in the deserthas a host of words concern- itg camelsand their equipage.And perhapsall of theselanguages adapt themselvesin such away as to minimae the effort involved in human communication.Of course,we don't really know whether or not thesethings are so. Zipf's data have been criticized. I find it impossible^Oictated to believe that the number of different words is entirely by length of sample,regardless of author. Certainly the fr.q.r.ncy wiitr *trictt the occurscannot changewith the lengih of sample, as Zipf's law in its simple form implies.It is said that Zipf's liw holds best for ,Xtn,. samplesizes of around 120,000words, that for smaller samplesone finds too many words that occur only once, and that for larger samplestoo few words occur only once.It seemsmost reasonable to assumethat only the multiple authorshipgave the newspaperffi the samevocabulary as Joyce. ;+ 5e*ptt So far, our approachto Zipf's law has been that of takirg it as le"3ah 246 Symbols, Signals and Noise an approximate description of experimental ciata and asking where 's this leads us. There is another approach to Zrpf law. One can attempt to show that it must be so on the basis of simple assump- tions concerning the generation text. While various workers have 's given proposed derivations, showing that Zrpf law follows from certain assumptions,Benoit Mandelbrot, a mathematician who was mentioned earlier, did the first satisfactory work and appearsto have carried such work furthest. Mandelbrot gives two derivations. In the first, he assumesthat text is produced as a sequence of letters and spaces chosen ran- domly but with unequal probabilities, as in the first-order approxi- mation to English text of Chapter III. This allows an infinite number of different "words" composed of sequences of letters separated from other sequencesby spaces. On the basis of this assumption only, Mandelbrot shows that the probability of occurrence of the rth of these "words" in order of commonness must be given bY p (r) - P(r + V)-" (l2.s ) The constantsB and V can be computedif the probabilitiesof the various lettersand of the spaceare known. B must be greaterthan 1. P must be suchas to make the sumof p(r) overall'owords" equal to unity. We seethat if V were very small and B were very nearly equal to I, I2.5 would be practically the same as Zipf's original law. Instead of the straight, 45" line of Figure XII-6, equation 12.5 gives a curve which is lesssteep at the upper left and steepetat ihe lower right. Such a curve fits data on much actual text better than Zipf's original law does. 'owords" It hai beenasserted, however, that the lengthsof the produced by the random processdescribed don't correspondto the length of words as found in typical Englishtext. Furtler, languagecertainly has nonrandom featurgs.Words get shortenedas theiiusage becomesmore common.Thus, taxi and cab came from taxicab, and cab in turn came from cabriolet.Can we say that the fact that the random production of lettersleads to the pioduction of "words" which obey Zipf's law explainsZipf's law? It seemsto me that we can assertthis only if we canshow how the forceswhich do shapelanguage imitate this random process. Inforntation Theory and Psycholog 247

In his second derivation of a modified form of Zipf's law as a consequenceof certain initial assumptions, Mandelbrot assumes that word frequencies are such as to maxim ize the information for a given cost. As a simple case, he assumes that each letter has a particular cost and that the cost of each word (that is, of each sequenceof letters ending in a space) is the sum of the costs of its letters. This leads him to the same expression as the other deriva- tion, that is, to equation 12.5.The interpretation of the different symbols is different, however. The constant B can be less than unity if the total number of allowable words is finite. Regardless of the meaning of the constants, P, V, and B in expression 12.5,we can, if we wish, merely give them such values as will make the curve defined by 12.5 best fit statistical data derived from actual text. Certainly, we can fit actual data better his way than we can if we assume that V - 0 and B - I (corre- 's sponditg to Zipf original law). In fact, by so choosing the values of P, V u19 B, equation 12.5can be made to fit available datavery well in all but a few exception cases. In the cases of modern Hebrew of around 1930 and Pennsylvania Dutch, which is a mixture of languages,a value of B smaller than I gives the best fit. According to Mandelbrot, the wealth of vocabulary is measured .hiefly by the value of B; if B is i much greater than l, &few words ul. used over and over i again; if B is nearer to l, & greater variety of words is used. Mandelbrot observes that as a child grows, B ItI I d.creasesfrom valuesaround 1.6 to values around I . 15 or to a uulue I around I if the child happensto be JamesJoyce. Certainly, I equation 12.5fits data better than Zrpf's original law I does.It overcomesthe '; t- objection that, accordirg to Zipf original t1*, the probability I of the word the should dependon the length I of the sampleof text. This doesnot mean,however, that Mandel- btot's explanation I or derivation of equation 12.5is necessarily I torrect. Further, it is possible that some other mathematical | *^pressionwould fit data concerning actual text even better. A much.more thorough I study would be necessary to settle such I q,resttons. Zipf 's law I holds for many other data than those conceroirg *ord I usage.For instance,in most countriesit holdsfor population citiesplotted I :f againstrank in size.Thus, the tenth largestcity has I about a tenth the population of the largest city, utrl so on. 248 Symbols,Signals and I'{oise However, the fact that the law holds in different casesmay be fortuitous. The inverse-squarelaw holds for gravitational attraction and also for intensity of light at different distancesfrom the sun, yet these two instnncesof the law cannot be derived from any bo-*on theory' Act'r *fl4'-fLr*y c6n' It is clear that our ability to receiveand handle information is influenced by inherent limitations of the human nervoussyste_m. George A. Miller's law of 7 plus-or-minus 2 is an example.This statesthat after a short period of observation,a personcan remem- ber and repeat the namesof from 5 to 9 simple, familiar objects, such as bin ary or decimal digits, letters,or familiar words. By meansof a tachistoscope,a brightly illuminated picture can be shown to a human subjectfor a very short time. If he is shown a number of black beans,he can give the number correctlyup to perhapsas many as 9 beans.Thus, one flash can conveya number b tnto"gh 9, or l0 possibilitiesin all. The information conveyed is log 10,or 3.3 bits. If ihe subjectis shown a sequenceof bin ary digits,he can recall correctly perhaps as many as 7, so that 7 bits of information are conveyed. If the subjectis shown letters,he can rememberperhaps 4 or 5, so that the information is as much as 5 log 26 bits, or 23 bits. The subjectcan rememberperhaps 3 or 4 short, commonwords, somewhaifewer thanT z.If these arechosen from the 500most common words, the information is 3 log 500, ot 27 bits. As in the caseof the reading rate experiments,the gain due to greater complexity outweighsthe loss due to fewer items, and the information increaseswith increasingcomplexity. Now, both Miller's 7 plus-or-minus-2rule and the readingra_te experimentshave embarrassingimplications. If a man gets only27 bits of information from a piclure, can we transmit by meansof 27 brts of information a picture which, when flashedon a screen' will satisfactorily imitate any picture? If a man can transmit only about 40 bits of information per second,&S the readingrate experi- ments indicate, can we transmit TV or voice of satisfactoryquality using only 40 bits per second? In eachcase t Uelievethe answerto be no. What is wrong?What is wrong is that we have measuredwhat gets out of the human Information Theory and Psycholog 249 being,not what goesin. Perhapsa human being can in somesense only notice 40 bits a secondworth of information, but he has a choiceas to what he notices.He might, for instance,notice the girl or he might notice the dress.Perhaps he noticesmore, but it gets away from him before he can describeit. Two psychologists,E. Averback and G. Sperling,studied this problem in similar manners.Each projected a large number (16 or 18) of letters tachistoscopically.A fraction of a secondlater they gavethe subjecta signalby meansof a pointer or tone which indicatedwhich of the lettershe should report. If he could unfail- ingly report any indicated letter, all the letters must have "gotten in," sincethe letter which was indicated was chosenrandomly. The resultsof theseexperiments seem to showthat far more than 7 plus-or-minus-2items areseen and storedbriefly in the organisffi, for a few tenths of a second.It appearsthat 7 plus-or-minus-2of theseitems can be transferredto a more permanent memory at a rate of about one item eachhundredth of a second,or lessthan a tenth of a secondfor all items. This other memory can retain the transferreditems for severalseconds. It appearsthat it is the size limitation of this longer-term memory which gives us the 7 plus- or-minus-2figure of Miller. Human behavior and human thought are fascinating, and one could go on and on in seekingrelations between information theory and psychology.I have discussedonly a few selectedaspects of a broad field. One can still ask,however, is information theory really highly important in psychology,or doesit merely give us another way of organizins data that might as well have been handled in some other manner?I myself think that information theory has provided psychologistswith a new and important picture of the processof communicationand with a new and important measure of the complexity of a task. It has also been important in stirring psychologistsup, in making them re-evaluateold data and seek new data. It seemsto ffie, however,that while information theory provides a central, universal structure and organrzattonfor elec- trical communication, it constitutes only an attractive area in psychology.It also adds a few new and sparklirg expressionsto the vocabularyof workers in other areas. CHAPTERXIII Information Theory)and Art

Soun yEARs AGo when a competent modern comPoser and professor of music visited the Bell Laboratories, he was full of the r.*r that musical sounds and, in fact, whole musical compositions can be reduced to a seriesof numbers. This was of course old stuff to us. By using pulse code modulation, one can represent any electric or u.o.6tic *aue form by means of a sequence of sample amplitudes. We had considered something that the composer didn't appreci- ate. In order to represent fairly high-quality music, with a band width of 15,000 cycles per second, one must use 30,000 samples per second, and each one of these must be specified to an accuracy bf p.thaps one part in a thousand. We can do this by using three decimat bigits (or about ten binary digits) to designate the ampli- tude of each sample. A composer could exercise complete freedom of choice among sounds simply by specifyirg a sequence of 30,000 three-digit deci- mal numbers a second. This would allow him to choose from among a number of twenty-minute compositions which can be written as I followed by 108 million O's-an inconceivably large number. Putting it another way, the choice he could exercisein composing would be 300,000 bits per second. Here *. r.rse what is wrong. We have noted that by the fastest demonstrated means, that is, by reading lists of words as rapidly as possible, a human being demonstrates an information rate of 250 Information Theory and Art 251 no more than 40 bits per second.This is scarcelymore than a ten- thousandthof the rate we haveallowed our composer. Further, it may be that a human being can make use of, can apprecrate, information only at somerate even lessthan 40 bits per second.When we listen to an actor, we hear highly redundant Englishuttered at a rather moderatespeed. The flexibility and freedomthat a composerhas in expressinga compositionas a sequenceof sampleamplitudes is largelywasted. 'ocompositions" They allow him to producea host of which to any human auditor will sound indistinguishableand uninteresting. Mathematically,white Gaussiannoise, which containsall frequen- ciesequally, is the epitomeof the variousand unexpected.It is the least predictable,the most original of sounds. To a human being,however, all white Gaussiannoise sounds alike. Its subtleties are hidden from him, and he saysthat it is dull and monotonous. If a human beingfinds monotonous that which is mathematically most various and unpredictable,what does he find fresh and interesting?To be able to call a thing new, he must be able to distinguishit from that which is old. To be distinguishable,sounds must be to a degreefamiliar. We can tell our friendsapart, we can appreciatetheir particular individual qualities,but we find much less that is distinctive in .strangers.We can,of course,tell a Chinesefrom our Caucasian friends,but this doesnot enableus to enjoyvariety amongChinese. To do that we haveto learn to know and distinguishamong many Chinese.In the sameway, we can distinguishGaussian noise from Romantic music,but this givesus little scopefor variety, because all Gaussiannoise sounds alike to us. Indeed, to many who love and distinguish among Romantic composers,most eighteenth-centurymusic sounds pretty much alike. And to them Grieg's Hotberg Suite may sound like eight- eenth-centurymusic, which it resemblesonly superficially.Even to thosefamiliar with eighteenth-centurymusic, the choral music of the sixteenthcentury may seemmonotonous and undistinguish- able. I know, too, that this works in reverseorder, for some partisansof Mozart find Verdi monotonous,and to thosefor whom Verdi affords tremendousvariety much modern music soundslike undistinguishablenoise. Of coursea composerwants to be free and original, but he also 252 Symbols, Signals and Noise wants to be known and appreciated.If his audiencecan't tell one of his compositionsfrom another, they certaitly won't buy record- ings of many different compositions.If they can't tell his composi- tions from those of a whole school of composers,they may be satisfiedto let one recorditg stand for the lot. How, then, can a composermake his compositionsdistinctive to an audience?Only by keeping their entropy, their information rate, their variety within the bounds of human ability to make distinctions.Orly when he doles his variety out at a tate of a very few bits per secondcan he expect an audienceto recogntzeand appreciateit. Does this mean that the calculating composer,the information- theoretic composerso to speak, will produce a simple and slow successionof randomly chosennotes? Of coursenot, not anymore than a writer producesa random sequenceof letters. Rather,the composer will make up his composition of larger units which are already familiar in some degree to listenersthrough the training they have receivedin listening to other compositions.These units will be orderedso that, to a degree, a listenerexpects what comes next and isn't continually thrown off the track. Perhapsthe com- poser will surprise the listener a bit from time to time, but he won't try to do this continually. To a degree,too, the composer will introduce entirely new material sparingly. He will familianze the listener with this new material and then repeat the materialin somewhat alteredforms. To use the analogy of language, the composer will write in a langu agewhichthe listener knows. He will producea well-ordered sequenceof musical words in a musically grammatical order.The words may be recognLzable chords, scales,themes, or ornaments. They will succeedone another in the equivalentsof sentencesor stanzas,usually with a good deal of repetition.They will be uttered by he familiar voicesof the orchestra.If he is a good composer, he will in someway convey a distinct and personalimpression to the skilled listener.If he is at least a skillful composer,his composi- tion will be intelligible and agreeable. Of course, none of this is new. Those quite unfamiliar with information theory could have said it, and they have saidit in other words. It does seemto me, however, that thesefacts are particu- Information Theoryand Art 253 larly pertinent to a duy in which composers,and other artistsas well, are faced with a multitude of technical resourceswhich are temptirg, exasperating,and a little frightening. Their first temptationis certainly to choosetoo freely and too widely. M. V. Mathewsof the Bell Laboratorieswas intrigued by the fact that an electroniccomputer can create any desiredwave form in responseto a sequenceof commandspunched into cards. He deviseda programsuch that he could specifyone note by each card as to wave form, duratior, pitch, and loudness.Delighted with the freedomthis affordedhim, he had the computerreproduce rapid rhythmic passagesof almost unplayablecombinations, such as three notes againstfour with unusual patternsof accent.These ingeniousexercises sounded, simply, chaotic. Very skillful composers,such as Vardse, can evokean impression of form and senseby patching togetherall sorts of recordedand modified sounds after the fashion of musique concrite. Many appealingcompositions utilizing electronicallygenerated sounds have already been produced. Still, the composer is faced with difficultieswhen he abandonstraditional resources. The composer can choose to make his compositions much simpler than he would if he were writing more conventionally,in ordef not to losehis audience.Or he and otherscan try to educate an audienceto rememberand distinguishamong the new resources of which they avail themselves.Or the composercan chooseto remain unintelligibleand await vindication from posterity.Perhaps there areother alternatives; cefiainly there areif the composerhas real genius. Does information theory have anything concreteto offer con- cerning the arts?I think that it has very little of seriousvalue to offer except a point of view, but I believethat the point of view may be worth explorirg in the brief remainderof this chapter. In ChaptersIII, VI, and XII we consideredlanguage. Language consistsof an alphabetor vocabulary of words and of grammatical rules or constraintsconcerning the use of words in grammatical text. We learned to distinguishbetween the featuresof text which are dictated by the vocabulary and the rules of grammar and the actual choice exercisedby the writer or speaker.It is only this element of choice which contributes to the average amount of 254 Symbols,Signals and Noise

rnformation per word. We saw that Shannonhas estimatedthis to be between3.3 and 7.2 bits per word. It must also be this choice which enablesa writer or speakerto conveymeaning, whatever that may be. The vocabularyof a languageis large,although we haveseen in ChapterXII that a comparatlely few words make up the bulk of any text. The rules of grammar are so complicatedthat they have not beencompletely formulated. Nonetheless, most peopl-e have a latge vocabulary, and they know the rules of gra--ui irt the sensethat they can recognizeand write grammatiial English. It is reasonableto assumea similarly surprisinglylarge kriowl- edgeof musicalelements and of relationsamong tfrb* on thepart of the person who listens to music frequentli attentively,and appreciatively.Of course,it is not necessary that the listenerbe able to formulate his knowledgefor him to haveit, anymore than the writer of grammatical English need be able to formulatethe rules of English grammar. He need not even be able to write music accorditg to the rules, any more than a mute who under- standsspeech can speak. He can still in somesens e knowthe rules and make useof his knowledgein listening to music. Such a knowledgeof the elementsandlules of the musicof a particular nation, era, or school is what I 'oknowitg have referred to as the languageof music" or of a style of music.However much the rulesof musicmay or may not be basedon physicallaws, a knowledgeof a languageof music must be acquired by yearsof practice,just as the knowledgeof a spokenlanguage is. li is only by meansof such a knowledgethat we can diitinguish the style and ihdividuality of a composition, whether lite riry or musiCal. To the untutored ear,the soundsof music will seemto be examples chosennot from a restrictedclass of learnedsounds but from all the infinity of possiblesounds. To the untutored ear,the mechani- cal rvorkings of the rules of music will seemto representchoice and variety. Thus, the apparent complexity of niusic will over- whelm the untutored auditor or the auditor familiar only with some other languageof music. We shouldnote that we can write sensewhile violating the rules of grammar to a degree(me heap big injun). We mighiliken the intelligibility of this sentenceto an English-speakingprtton to our Information Theory and Art 255 ability to appreciatemusic which is somewhat strangebut not entirelyforeign to our experience.We shouldalso note that we can write nonsensewhile obeyingthe rules of grammar carefully(the alabasterword spokesilently to the purple). It is to this second possibility to which I wish to addressmyself in a moment. I will remark first, however, that while one can of course both write senseand obey the rules while doing so, he often exposeshis inadequaciesto the public gazeby thus being intelligible. It is no news that we can dispensewith sensealmost entirely while retaininga conventionalvocabulary and someor many rules. Thus, Mozart provided posterity with a collection of assort€d, numberedbars in Tatime, togetherwith a set of rules (Koechel 294D).By throwirg dice to obtain a sequenceof random numbers and choosingsuccessive bars by means of the rules, even the nonmusicalamateur can "compose"an almost endlessnumber of little waltzeswhich soundlike somewhatdisorganized Mozart. An exampleis shown in Figure XIII-I. Joseph Haydn, Maximilian Stadler,and Karl Philipp EmanuelBach are said to haveproduced similar random music.In more recenttimes, John Cagehas used random processesin the choiceof sequencesof notes. In ignoranceof theseillustrious predecessors,in 1949M. E. Shannon('swife) and I undertook the composi- tion of somevery primitive statisticalor stochasticmusic. First we made a catalogof allowedchords on roots I-VI in the k.y of C. Actually, the catalogcovered only root I chords; the otherswere

fof - a.t

Fig. XIII-I 256 Symbols, Signals and Noise

derived from theseby rules.By throwirg three speciallymade dice and by using a table of random numbers,a number of ro*posi- tions were produced. In these compositions,the only rule of chord connectionwas that two succeedingchords have a common tonein the samevoice. This let the other voicesjt*p around in a wild and ratherunsatis- factory manner. It would correspondto the use of a simple and consistentbut incorrect digram probability in the construition of synthetictext, as illustrated in chapter III. While the short-rangestructure of thesecompositions was very primitive, an effort wasmade to give them a plausibleand reason- ably memorable,longer-range structure. Thus, eachcomposition consistedof eight measuresof four quarter notes each.The long- range structure was attainedby making measures5 and 6 repeit measuresI and 2, while measures3 and 4 differed from measures 7 and 8. Thus, the compositionswere primitive rondos.Further, it was specifiedthat chords l, 16,and 32 haveroot I and chords 15 and 3l have either root IV or root Y in order to give the effect of a cadence. Although - the compositions areformally rondos, they resemble hymns. I have reproduced one as Figure XIII-2. As all hymns should have titles and words,I have provided theseby nonrandom means.The other compositionssound much like tire one given. Clearly, they are all by the samecomposer. Still, after a few hear- ings they can be recognizedas different. I have even managedto grow fond of them through hearing them too often. They must grate on the ears of an uncorruptedmusician. In 1951, David Slepian,an information theorist of whom we have heard before, took another tack. Following someearly work by Shannon,he evoked such statisticalknowledge of music as luy latent in the breastsof musically untrained *uthematicians who were near at hand. He showedsuch a subjecta quarter bar, a half bar, or three half bars of a "composition" and isked the subject to add a sensiblesucceeding half bar. He then showedutroihet subject an equal portion includirg that added half bar and got anotherhalf bar, and so on. He told the subjectsthe intendedstyles of the compositions. Information Theory and Art 257

RANDOM

When once my ran dom thoughts I turned That Chrlstmas day, dld they fore- see 0r dld they learn of us who sing

On Beth- Ie hem and on that star The Chlld grown Chrlst, and Chrlst de nied The day when an gels sang be fore

Whlch drew three wlse men from a far And Chrlst be -trayed and cru cl f1 ed And, Iay - lng down the gifts they bore

I won dered what the wlse men I earned And rls en Chrlst the De 1 ty? Fore-te1l our glfts and car o1 1ng?

Fig. XIII-2

In Figure XIII-3, I show two samples:a fragment of a chorale in which eachhalf bar was added on the basis of the preceding half bar and a fragrnent of a "romantic composition," in which each half bar was addedon the basisof the precedirg threehalf bars. It seemsto me surprising that these "compositions" hang togetheras well asthey do, despitethe inappropriate and inadmis- sible chords and chord sequenceswhich appear. The distinctness 258 Symbols, Signals and hloise CHORALE:

ROMANTIC:

Fig. XIII-3

of the stylesis also arresting;apparently the mathematicianshad quite different ideas of what was appropriate in a chorale and what was appropriatein a romantic Composition. Slepian'-s - experimentshows the remutkable flexibility of the human being as well as some of his fallibility. Tiue siochastic processesare apt to be more consistentbut duller.A numberhave been usedin the compositionof music. There is no doubt that a computer supplied with adequate statisticsdescribing the styleof a compor.r .orld produce,ur-do* music with a recognrzablesimilarity to a.o*poser's style.The nursery-tune -styledemonstrated by Pinkerton anOthe div-ersityof st-ylesevoked by Hiller and Isaacior, which I will describrpr.r- ently, illustrate this possibility. In 1956,Richard C. Pinkerton publishedin the ScienttfcAmeri- can somesimple schemes for writing tunes.He showedhow a note could be chosenon the basis of iti probability of following the particular preceditg note and how the probabiiitieschangeO-witfr respectto the position of the note in the bar. Using probabilities derived from nursery tunes, he computed the ent;by per note, Information Theory and Art 259 which he found to be 2.8bits. I feelsure that this is quite a bit too high, becauseonly digram probabilitieswere considered. He also presenteda simple finite-statemachine which could be used to generatebanal tunes,much as the machineof Figure III- 1 gener- ateso'sentences." In 1957,F. B. Brooks,Jr., A. L. Hopkins,Jr., P. G. NeumaruI, and W. V. Wright publishedan accountof the statisticalcomposi- tion of music on the basis of an extensivestatistical study of hymn tunes. In 1956,the BurroughsCorporation announced that they had useda computerto generatemusic, and, in 1957, it wasannounced that Dr. Martin Klein and Dr. Douglas Bolitho had used the Datatron computert

r Reproduced from L. A. Hiller, Jr., and L. M. Isaacson, Illiac Suite for String Quartet, New Music, 1957, by permission of Theodore Presser Company, Bryn Mawr, Pa. 260 Symbols, Signals and Noise (c)

Fig. XIII-4 of some simple pattern or repetition might have helped consider- ably. This could be of a strictly deterministicnature,L, in the case of the prescribedrepetitions in a rondo, or it could be of the nature of Chomsky's grammar,which we have consideredin ChapterVI. It is clear, however,that it is foolish to try to attatn long-range structure li*ply by relating a note to the immediately prJcediig notes by digram, trigram, and higher probabilities.Th; relation must be amongparts of the composition,not simply amongnotes. The work of Hiller and Isaacsondoes demonstrate conciusively that a computer can take over many musical choreswhich only human beings had been able to db before. A composer,and especially an unskilled composer,might very well rely on a com- puter for much routine musical drudgery. The composercould merely guide the main pattern of th; rb-position and let the computer fill in details of harmony and co.rnterpoint,according to a specificationof style or period. Further, the computercouli Information Theoryand Art 26r be usedto try out proposednew rules of composition,such asnew rules of counterpoint or harmotry, with whose use and conse- quencesthe composermight have little experienceand familiarity. In thesedays we hearthat cyberneticswill soongive us machines which learn. If they learn in a complicatedenough sense of the word, why couldn't they learn what we like, even when we don't know ourselves?Thus, by rewarding or punishing a computer for the successor failure of its efforts,we might so condition the com- puter that when we presseda button marked Spanish,classical, rock-and-ro11,sweet, etc., it would producejust what we wanted in connectionwith the terms.Such thoughts areintriguing, but they are of coursenonsense in our duy and will probably remain so for a long time to come. Music is not all of art. I began with music becauseit offers an apt meansfor illustratingin an unusualcontext someideas derived from information theory. We could just as well draw our illustra- tions from the use of lan guage.Indeed, experiments with the stochasticproduction of text have been perhaps more widely cultivated than experimentswith music. A professorat the Grand Academy of Lagoda showedCaptain Lemuel Gulliver a word frame consisting of lettered blocks mounted on shafts. The professor turned these at random and soughtnew wisdom in the patterns of letters which appeared. Here we seejust the wrong applicationof a stochasticprocess in the generation of text. Certainly, this will not give us new knowledge.Who would take the uncorroboratedword of a random process?There are all too many unsubstantiatedstatements avail- able; what we needto know is what is so and what isn't. Nonetheless,a stochasticprocess can produce someinteresting effects.In Chapter III we noted Shannon's approximations to English text. Thesewere made by using letter digram and trigram frequenciesand a tableof random numbers.We haveseen that they contain someinteresting "words." To me, deamyhas a pleasantsound; I would take ooit'sa deamy idea" in a complimentarysense. On the other hand, I'd hate to be denouncedas ilonesive.I would not like to be calledgrocid; per- haps it reminds me of gross,groceries, and gravid. Pondenome, whateverit may be, is at least dignified. 262 Symbols,Signals and Noise

I repeat Shannon'ssecond-order word approximationhere: The head and in fron tal attack on an English writer that the character of this point is therefore another method foi the letters that the time who ever told the problem for an unexpected.

I find ttit disquieting. I feel that the English writer is in mortal peril' yet I cannot come to his aid becausethe latter part of the rhessageis garbled. In seekitg lessgarbled material, as I noted in ChapterVI, I wrote three grammatically connectedwords in a colu*r from the top down on a slip of p"aper.I showedthem to a friend, askedhim to make up a sentencein which they occurred,and then to add the next word in this sentence.I then folded over the top word of the four I now had and showedthe visible three to anotherfriend and got another word from him. After canvassingtwenty friends,I had the following:

When morning broke after an orgy of wild abandon he said her head shook vertically aligned in a sequence of words signifying wh at . . .

Later examplesare:

One duy when I went to what was Dionysus thinking of women without men go off half way decent impression . . . I forget whether he went on and on. Finally he stipulated that this must stop immediately after this. The last time I saw him when she lived. It happened one frosty look of trees waving gracefully against the wall. You never can . . .

We seethat a seemingorder of meaning persistsover groupsof far more than four words. Eventually, however,the text wanders. The long-rangewandering is of courseattributable to the fact that there is no long-rltge, persistentpurpose or meaningguiding the choiceof words. We sometimessee.a similar 'VV. quality"in the ,ittrr- ancesof schizophrenics. H. Hudson illustrates a more gradual wandgringadmirably by u characterin his ThePurple Land; Uncle Anselmo never gets to the end of a story becausehe continually wanderson to new ground. One can add a certain amount of long-rangeorder by writing, in view, at the bottom of the slip of paper to which peopleadi Information Theory and Art 263

words a title which indicateswhat the passageis supposedto be about. Dr. Donald A. Dunn of the StanfordElectronics Laboratory has kindly supplied me with some examples in which the person addinga word sawonly oneprecedirg word, togetherwith the title:

Men and Women Eve loved intensely sentimental or not sufficient tonight wherever you may die before yesterday again and whatever m'love misbehaves. The seduced are compatible unusual family life seemed wonderful experience for tendernessforever yours. Orphans frequently visited his promiscuity and infidelity despite hate and love for tomorrow sex ain't nothirg.

In the followirg examples,which were produced at the Bell Laboratories,the personadding a word saw threeprecedirg words as well as the title:

About Life Life has many good and wise men seldom condemn halfwits lightly! You wonder why not. Human feelings but savagetribes found . . .

Engineers It is frequently said that they knew why forces might affect salaries. However, all scientistscan't imagine . .

Housecleaning First empty the furniture of the master bedroom and bath. Toilets are to be washed after polishing doorknobs the rest of the room. Washing windows semiannually is to be taken by small aids such as husbandsare prone to omit soap powder.

Murder Story When I killed her I stabbed Claude between his powerful jaws clamped tightly together. Screaming loudly despite fatal consequences in the struggle for life ebbing as he coughed hollowly spitting blood from his ears.

I think that it is hard to readsuch material without amusement. I feela little admirationas well. I would neverwrite, "It happened one frosty look of trees waving gracefully against the wall." I almost wish I could. Poor poetsendlessly rhyme love with dove, 264 Symbols, Signals and Noise

and they are constrainedby their highly trained mediocrity never to produce a good line. In some sense,fl stochastic process can do better; it at leasthas a chance.I wish t had hit on deamy, but I never would have. Will a compuJerproduce text of any literary merit by meansof grammatical rules and a sequenceof random numb.trt It might z'words" produce fresh and amusing and amusingshort passages of some shock value. One can of course imaline a machine designedto write detectivenovels and equippeOivith settingsfor hard-boiled, puzzle,char acter, suspense,u"o so or, but such a device seemsto me to be very far away. The visual arts can be used to illusirate the samepoints which have been made in connectionwith music and language.A com- pletely random visual pattern, like a completelytuidom acoustic wave or a compl:tely random sequenceof letters,is mathematically the most surprit_t.g,tl. least piedictable of all possiblepatt.t"i. Alas, a completelyrandom patfern is alsothe dullestof all patterns, and to a human -b.ti"gone random pattern looksjust like^another. Figure XIII-5, which is an arrayof i0,000 randornlyblack or white dots, illustrates this. Bela Julesz, who works in the field of perception,caused an

Fig.XnI-s Information Theory and Art 265 electroniccomputer to producethis random array of dots asa part of his studiesof stereoscopicvision and of the meaningof pattern. He alsoprogrammed the computerto removesome of the random- ness from such a random pattern. He did this by making the computerexamine successively various setsof five points located at the tips and at the centerof an X, as shownby the points marked X rn Figure XIII-6 (otherpoints are marked O). If the centerpoint was the same (black or white) as either points I and 4 or points 2 and 3, it was changed(from black to white or from white to black). This tends to remove any black or white diagonals,except when points I and 4 are black and points 2 and 3 are white or vice versa. As we can seefrom FigureXIII-7, making a patternless random in this way alters and improves its appearanceprofoundly. An unpredictable (random) rontponent iJ besirable ior the ruk. of variety or surprise,but someorderliness is necessaryif a pattern is to be pleasing. This exploitationof both order and randomnessis in fact old to art. The kaleidoscopeoffers a charmirg effect by giving to a random arrangementof bits of colored glassa sixfold symmetry.

o o o o o

2 o o x o

o o o o

4 o o x o

o o o o o Fig. XIII-6 266 Symbols, Signals and Noise

Fig. XIII-7

M?ty years ago Marcel Duchamp, who painted Nude Descending a Staircnse,allowed a number of threadsto fall on piecesof black cloth and then framed and preserved them. Jean-Tinguley, the Swissartist, has produced, by meansof a machine,partly ordered, partly random colored designsof considerablemeriti I derive continuing pleasurefrom one which hangsin my office.I saved for yearsa pile of solderdroppings which I intendedto mount on a block of ebony and present to the Museum of Modern Art. Finally, I lost both the solderand the desireto do so. All of this has given me a sort of minimum philosophyof art, which I will not, I hastento assurethe reader,6lame on informa- tion theory. It is a minimum philosophy becauseit saysnothing about the ialent or geniuswhiifr aloni iur make art worth while. Successfulart requiresthe appreciationof an audienceas well asthe talent of the artist. Audiencesare influenced by thingsother than the object of art beforethem. If a personsets his mind against it,.anything will leave him cold. A desireto appreciatecan, on the other hand, lead to one's liking even poor works. I like the hynn- like compositionsthat Betty Shannonand I made.Authors some- times prefer inferior works of their own. Both small coteriesand largegroups can be led to appreciatesincerely things which arefor Information Theory and Art 267 a time the fashion but which have little long-rangeappeal and which probablyhave little merit. Among other things,audiences want to have a senseof author- ship, a senseof an individual,in connectionwith works of art. To bring appreciationto an artist,his work must haveenough con- sistencyso that it is recognrzableas his. How let down the sincere appreciatormust be if he alwayshas to look at the label or wait for the announcerin order to know that the painting or music is the product of his favoriteartist. Supposethat one artist had actuallyproduced in successionthe masterpieceswe now acceptas the works of a number of great artistswith diversestyles, long beforethe artistslived. This would astonishuS, but we could scarcelyappreciate him as an artist, howevermuch we might admire the individual paintings.Picasso is eminentlyrecogntzable, but he is disquieting.He hasbeen skillful in many styles,and yet he escapesour finaljudgment by goingfrom one styleto another.How much easierit is to appreciateMatisse. To be appreciatedby an audience,art must be intelligibleto the audience.Even a goodjoke in Chinesewill amusefew Americans, and certainlyten jokes in Chinesewill be no more amusingthan one. To a degree,to be appreciatedart must be in a language familiar to the audience;otherwise no matter how greatthe variety may be, the audiencewill have an impressionof monotony,of sameness.We ca-nbe surprisedrepeatedly only by constrastwith that which is familiar, not by chaos. Someartists adopt a languagetaught to their audienceby earlier masters.Brahms was one of these.Other artiststeach somethirg of a new languageto their audiences,as the impressionistsdid. Certainly,the languageof art changeswith time, and we should be grateful to the artistswho teach us new words. However,we shouldnot doubt the originalityof suchartists as Bachand Handel, who spokeringingly in a languageof the past. While a languagewith intelligible words and relationsbetween wordsis necessaryin art, it is not sufficient.Mechanical sameness is dull and disappointing.I prefer the surprisesof stochasticprose to the vapidverses of OwenMeredith. Perhaps in someage of bad att, man will be forced to stochasticart as an alternativeto the staleproduct of human artisans. So much for information theory and art. cHAPTERXIV Back to Communication Theorl

SunEtY, IT IS woNDERFULif a new idea contributes to the solution of a broad range of problems. But, first of all, to be worthy to notice a new idea must have some solid and clearlyJ demonstratedvalue, howevernaffow that value may be. An information theorist has critrcrzedme for explorirg in this book possibleapplications of information theory in fieldi of lan- guage, psychology, and art. To him, the relation betweensuch subjectsand information theory seemsmarginal or evendubious. Why distract the reader from the clearly demonstratedvalue and i{portance of information theory by discussingmatters concerning which no clear value or importance can be demonstrated? Partly, in writing this book I have felt an obligation to thereader to discussrelations betweeninformation theory in its solid and narrow senseand various fields with which it has been connected in the writings of others.Partly, I believethat information theory is useful in helping us in talkitg senseor at least in keepingfrom talking nonsensein connectionwith some linguistic,ariistii, and p.sychologicalproblems. However, there is a dangerin overempha- sizing such matters in a book on information theory. It would certainlybe wrong to assertor to believethat informa- tion theory is valuablechiefly because of wide-rangingconnections 268 Back to Communication Theorv 269

with a variety of fields such as language, cybernetics, psychology, and aft. To believe this would be to repeat mistakes which have been made in connection with other important discoveries. Thus, in Newton's duy his work was beclouded by controversy and philosophy, and for many years thereafter it was associated in people's minds with a putative universality which confused its real nature. Einstein, however, could see more clearly. He said: "Reaso[, of course, is weak when measured against its never ending task." Einstein then described Newton's contribution to this task of understanding and observed, "and with that, the goal was reached, the scienceof celestial mechanics was born, confirmed a thousand times over by Newton himself and those who came after him." It is fair to add that since Newton's da1l, Newtonian mechanics has been useful in solvitg or contributing to the solution of prob- lems that never entered the minds of Newton and his contempo- raries, but it has not solved all problems of science, as some optimistic philosophers expected it to. To me the indubitably valuable content of information theory seems clear and simple. It embraces the ideas of the information rate or entropy of an ergodic message source, the information capacrty of noiselessand noisy channels, and the efficient encodirg of messagesproduced by the source, So as to approach errorless transmission at a rate approaching the channel capacity. The world of which information theory gives us an understanding of clear and present value is that of electrical communication systems and, especially, that of intelligently designing such systems. It seems to me wise at the close of this book to turn away from the broad, speculativepossibilities (or impossibilities?) of informa- tion theory and to ask the followirg question: Beyond the things already described in this book, what have information theorists done and what are they doing that is mathematically sound, well founded, compelling? What, in other words, have they done that qualifies as sound science which we must accept rather than as intriguitg speculation that we have the privilege of arguing about? Here we find a broad range of work. To explain all of it fully to the reader would take another book. Thus, this chapter will be a brief summary of some of the work of information theorists since 270 Symbols, Signals and Noi,ge

the publication of Shannon'soriginal paper. Its purpose is to acquaint the reader with the scope of information theory in its narrow senseand, perhaps, to entice him into followirg such activities in greaterdetail. One thing that information theoristshave soughtis someappli- cation of the entropy of information rate of a messagesource to a problem other than that of encoding and transmissionof informa- tion. Ambitious men want to bring meaning into the picture somehow, but a more modest worker is willirg to settlefor any application which is meaningful and rigorouslycorrect. The only application of information rate to a problem other than efficient encoding which has been given so far and which meetsthese criteria was advancedby J. L. Kelly, Jr., in 1956.rIt concernsgambling on chanceevents in which the bettor hasinside information as to the outcome of the event bet upon. We might imagine, for instance,that the dice are alreadythrown (or the race run) and that the favoredbettor knows this and hasreceived some knowledge of the outcome, but the person with whom he bets doesn't know this and givesthe bettor fair odds on the basisof the chanceof the outcome. The information which the bettor receivesis doled out to him in bits, that is, yes-or-no answers to questiols. His informant could, for instance, inform the bettor completely conceroirg whether a coin tossedhad turned up headsor tails by sendinghim one bit of information. Or the informant could narrow foi the bettor the possibleoutcomes of the cast of a die from 6 to 3 by using one bit of information to tell the bettor whetherthe outcome was odd or even. Followitg this introduction, I can best explain Kelly's resultby quoting the abstractof his paper:

If the input symbols to a communication channel represent the outcomes of a chance event on which bets are available at odds consistent with their probabilities (i.e., "fair" odds), a gambler can use the knowledg. given him by the received symbols to cause his money to grow exponentially. The maximum exponential rate of growth of the gambler's capital is equal to

1"New Interpretation of Information Rate," Bell SystemTechnical Journal Vol. 35 (July, 1956),pp. 917-926. Back to Communication Theory 271

the rate of transmission of information over the channel. This result is generalized to include the case of arbitrary odds. Thus we find a situation in which the transmission rate is significant even though no coding is contemplated. Previously this quantity was given significance only by a theorem of Shannon's which asserted that, with suitable encoding, binary digits could be transmiited over the channel at this rate with an arbitrarily small probability of error.

Numerically the factor by which the gambler's initial capital is increased after l/ bets is 2NR

Here R is the average number of bits of information transmitted to the bettor per bet. If this seemsa trivial applicationof the amount of information in bits, the readershould meditate on the fact that it is the only mathematicallyestablished interpretation, other than thosecon- cernedwith the rate of generationof probablemessages and their efficientencoding for transmission,that anyonehas discovered. In advancinginformation theory, one may seeka new usefor information theoryrather than a new interpretationof information rate. Thus,in 1949,C. E. Shannonpublished a long paper entitled "Communication Theory of SecrecySystems."2 It is doubtful whether this paper has helped substantiallyin the decipheringof messages,but it has provided,for the first time, a well organized theory of cryptographyand ,and it is highly regarded by the expertcryptanalysts. It would be hopelessto try to go into the detailsof Shannon's work here,but I will try to give an idea of someof its content. The cryptanalystwho lays hands on a messageenciphered by an unknown meansis ignorant of two things: the messageitself and a specificationof the meansused to encipherit, which we may call the k.y. Sometimes,the cryptanalystmay know the generalscheme of encipherment.To take a ridiculously simple example,he might know that a simplesubstitution cipher had beenused, that is, for eachletter of the alphabetsome other letter had been substituted accordingto a fixed scheme. 2lbid., Vol. 28 (October,1949), pp. 656-715. 272 Symbols, Signals and lt,'oise The cryptanalystmay havea short or a long encipheredmessage to work with. If the messagehad only threeletters in it, sayQXD, thesemight stand for AND, or BET, or any other Engliih word made up of three differentletters. As the messagebecomes longer, however, the number of possibleEnglish texts which could hive been encryptedby meansof a simplesubstitution cipher to givethe particulai message at hand decreases;if the enciphered,i.rrug. is long enough,there will be only one possiblesource message. Shannonexpressed this decreaseof uncertainty as to whaf mes- sage might have been encipheredso as to give the messagein question as a changein the equivocation.The equivocationHr(*) of Chapter VIII gives the uncertainty of what messagewas encipheredby the generalmeans in questionin order to give the receivedenciphered message. Shannon was able to computein the case of various how the equivocation decreasesas the number of charactersin the messageincreases. When the equivo- cation approacheszero, only one messagecould have been en- ciphered to give the encipheredmessage, and, in principle, the messagecan be deciphereduniquely. What other sorts of problemshave confronted or now confront information theorists?Some of theseproblems concern the sampl- ing theorem.Information theoristsuse the samplirg theoremin order to represent a smoothly varying, band-limited signal by means of a sequenceof numbers; the sample numbers are the amplitudesof the signal taken every l/zW seconds,where W rs the band width of the signal. The sampleswhich representa given band-limited signalare not unique; they can be taken at varioustimes. Thus, in FigureXIV-1, either the vertical solid linesor the vertical dashedlines aresamples which legitimately representthe function, and samplescould have been taken at many other locations. In fact, the samplesdon't even have to be equally spacedin tirne, provided that, on the average,there are two 2w samplesper second!

Fig. XIV-I Back to Communication Theorv 273 A band-limitedsignal is representeduniquely by 2W samples per secondonly when all samplesfrom the infinite past to the infinite future are used.Sometimes we would like to talk about a piece of band-limited signal or about a band-limited signalwhich is almost zero except for some specifiedrange of time, and we would like to describesuch a portion of a signal or a signalof limited duration handily in terms of samples. Our first thought might be, can we merely specifya short signal or a portion of a signalby specifyirg the valuesof a finite sequence of samplesand sayingnothing about samplesbefore or after these? Alas, specifyitg such a finite set of samplesdoes not specifyjust one band-limited signal; many differentband-limited signalscan be passedthrough a finite sequenceof samples,and, if the signals are very large outsideof the range of the specifiedsamples, they can be very differentwithin the rangeof the specifiedsamples. This failing, we might sa), let us specify certain successive samplevalues and make all precedingand succeedingsamples be zero. Surely, we may think, the band-limited signal so specified will conform closelyto the samplevalues where theseare not zero and will be small whereverthe samplesare specifiedas zero. Suppose,for instance,that we insist that all of a set of equally spacedsamples after a tim a to are zero,while the samplesbefore the time ts are nonzero,as shown by the dots in Figure XIV-2. Becausethe samplesare specifiedfor all times past and future, they do specifya unique band-limited signal.Will this signal be nearly zero for times after ts? Alas, H. O. Pollak,of the Bell Laboratories,has shown that this neednot be so. Supposewe ask,what part of the total energyof

------{*f-*€€+-

I I

t -t> to Fig. XIV.2 214 Symbols, Signals and lt{oise the band-limited signalpassing through such samplesis carried by the part of the wave which occurs ten seconds,or twenty minutes, or fifty yearsafter /s? Rememberall the samplesare zero after /e. The surprisinganswer is that almosthalf of the energyof the signal can be carriedby the part that occurslater than any specified time after the samplesbecome zero.Thus, the signal can be zerc at all the samplesaftet to and still be large in betweenthem. Efforts to use the samplirg theorem rigorously to represent signals of limited length are in mathematicaltrouble, and mathe- maticians are trying to find some way out. Work by Pollak and Slepianindicates that neither samplesnor sine wavesare the most appropriateway to representband-limited functions of finite duration, and thesemathematicians have used a more appropriate group of functions called prolate spheroidal functions for this purpose. One puzzltngmatter about information theory may be illustrated by the followirg example. Supposethat in telegraphy we let a positive pulserepresent a dot and a negativepulse represent a dash. Suppose that some practical joker reversesconnections so that when a positive pulseis transmitteda negativepulse is received and when a negative pulse is transmitted a positive pulse is received.Because no uncertaintyhas been introduced, information theory says that the rate of transmissionof information is just the same as before.Yet we feel that somedamage has been doneto the communication system. The damage would be even more appalling if, in a teletypewriterlink, we consistentlyprinted out W for A, K for B, and so on, in a completelyscrambled fashion. This bothered Shannor, and he has worked out a theory to cover the situation.In this theory,he establishesa fidelity criterion. Thus, he might assigna givenpenalty for substitutinga consonant for a vowel and a lesserpenalty for substituting one vowel for another. He can then assessthe damagedone to a messageby either consistent or random errors, When the damage is done by the random errors of a noisy channel,he showsin principle how to minimize rt, and he showshow many bits per secondare required to transmit the signal with a given degreeof fidelity. Shannonhas also done a considerableamount of work concern- Back to Communication Theorv 275 itg the transmission of messagesover networks in which one mes- slge may interfere with another message. The simplest case is that of transmission of messagesin both directions over the same channel between two points, ,4 and .8. As a very special case,we will assume that the circuit acts the same from B to A as from AtoB. Suppose that we plot the channel capacity for transmission from A to B against the channel capacity for transmission from B to A, as shown in Figure XIV-3. We can imagine two very simple cases. In one case, transmission from B to A does not interfere with transmission from A to B, and transmission from A to B does not interfere with transmission from B to A. In this case, the curye consists of the horizontal solid line giving the channel cap actiy from B to A and the vertical solid line giving the channel cap acity from ,4 to B. Or we can imagine that at one time we can transmit in one

NO A INTERFERENCE - _i_r I \r _\\ dI INTERMEDIATE \ CASE P \ \ \ o al \ z \ o \ U UJ o E. lu COMPLE}. o- INTERFERENCE a \ t cl

B|TS PER sEcoND, B TO A -t> Fig. XIV-3 276 Symbols, Signals and I'{oise direction only, either from A to B or from B to A. Then if we are transmitting from A to B one-third of the time, we can transmit from B to A only two-thirds of the time, and so on. The sum of the channel capacity from B to A and the channel capacity from A to B must be a constant, and the result is the dashed 45o line of Figure XIV-3. In an intermediate case, in which there is some interferencebe- tween transmission in the two directions, we will get a curve roughly of the form of the dotted line of Figure XIV-3. The study of efficient encoding continues to command the atten- tion of information theorists. In the case of discrete channels,in- formation theorists continually better codesfor correcting A errors in a sequenceof B digits. Infoimation theorists also seek best codes for transmitting in- formation over a noisy continuous channel. In 1959, Shannon published a long paper in which he arrived at upPer and lower bounds on the attainable error rates for codes of various com- plexity (that is, length) used in signalingover a continuouschannel witn Gaussian noise. Currently, convolutional codes and Viterbi decodi.g are favored, and ingenious men vie in making better and cheaper decoders. Further, engineerswho wish to improve electrical communication continually try to find new encoding and transmission schemes which are simple enough to be useful. They try to encode television and voice signalsinto as few binary digits per second as they can; the approaches they use have been indicated in Chapter VII. Such efficient encoding is growing in importance because digital trans- mission of signals ( as in pulse code modulation ) is displacing analog communication. It will grow in importance as the encrypt- irg of signals in order to obtain privacy or secrecy becomesmore commotr, for secrecy is best attained by digital means. Engineers also lbok for simple and efficient error-correcting means useful in correcting the multiple errors which occur in the transmission of digital signals over existing telephone circuits. The use of digital tranimission in transmitting text and in transmitting business ind technical data is growing by leaps and bounds, both in military and in civilian applications. Telephone circuits go almost eurty*here. To keep pace with data, we must use voice Back to Communication Theory 277

circuits for data. Here error correction by error detection and re- transmission is the favored technique. But, error-correcting codes have their place. There are special circumstances that call for special means of modulation. is one. In a city, signals reach the vehicle from many directions after bouncing off many buildings, and a short pulse would be received as a smear of pulses which have traveled different distances over different paths. Here great ingenuity is needed in finding the best way to use a large ovlrall bandwidth in improving transmission. Military communication, especiallyin the face of jamming, poses a host of problems. Perhaps some would regard all this as engineering drudgery, unexciting compared with the broad philosophical vistas which information theory seems to open to us. Can an informed under- standing, a loving appreciation of the nature, virtues and distinc- tions among the French Impressionists or the Dutch genre painters ever be so meaningful as a sudden and bewildering con- frontation with a new and strange world of art, such as the Japanese? Yet, the connoisseurwho pursueswith devotion the details of a field may well have as much insight and as sound values as the rapturous dilettante. There is some intellectual obligation to appreciate a field for what it is rather than for the reactions it excites in the minds of the uninformed. I hope that this book has its exciting aspects,but I also hope that it won't lead the reader to a view of information theory widely difterent from that held by informed workers in the field. Hence, it is perhaps well to end in a vein. APPENDIX: On Mathematical Notation

TnE READERwILL FIND a fairly liberal use of mathematical notation in this book, includirg a number of equations.This may incline him to say the book is full of mathematics. Of courseit is. Communicationtheory is a mathematicaltheory, and, as this book is an exposition of communicationtheory, it is bound to contain mathematics.The readershould not, however, confuse the mathematicswith the notation used.The book could contain just as much mathematicsand not include one symbolor equality sign. The Babyloniansand the Indians managedquite a lot of mathe- matics, including parts of algebra, without the aid of anything more than words and sentences.Mathematical notation came much later. Its purposeis to make mathematicseasier, and it does for anyone who becomesfamiliar with it. It replaceslong strings of words which would have to be usedover and over agaLnwith simple signs.It providesconvenient names for quantitiesthat we talk about. It presentsrelations concisely and graphicallyto the €Ye,so that one can seeat a glancethe relationsamong quantities which would otherwisebe strewn through sentencesthat the eye would be perplexedto comprehendas a whole. The useof mathematicalnotation merelyexpresses or represents mathematics,just as letters representwords or notes represent music. Mathernaticalnotation can representnonsense or nothirg,

278 On Mathematical Notation 279 just as jumbled letters or jumbled notes can representnothing. Crackpotsoften write tracts full of mathematicalnotation which standsfor no mathematicsat all. In this book I havetried to put all the important ideasinto words in sentences.But, becauseit is simplerand easierto understand things written conciselyin mathematicalnotation, I have in most casesput statementsinto mathematicalnotation also. I have to a degreeexplained this throughout the book, but here I summar:ze and enlargeon theseexplanations. I have also ventured to include a few simple related matterswhich are not used elsewherein this book, in the hope that thesemay be of somegeneral use or interest to the reader. The first thing to be noted is that letterscan stand for numbers and for other things as well. Thus, in Chapter V, Bi standsfor a group or sequenceof symbols or characters,a group of letters perhaps;/ signifieswhich group. For the first group of letters,jr might be l, and that first group might be A\AA, for instance.For another value of j, say, l2l, the group of lettersmight be ZQE. We often have occasionto add, subtract, multiply, or divide numbers. Sometimeswe representthe numbers by letters. Ex- amplesof the notationsfor theseoperations are:

Addition 2 + 3 a + d We read a + d as "a plusd." We may interprct a * d as the sumof the number representedby a and the number represented by d. Subtraction 5 -4 q- r We read q r as "q mrnusr." Multiplication 3 X 5 or 3:5 or (3)(5) uxvoru.voruv lf we did not use parenthesesto separate3 and 5 in (3) (5), we would interpret the two digits as 35 (thirty-five). We can use 280 Symbols, Signols and Ir{oise

Parenthesesto distinguish any quantitieswe want to multiply. We could write uv as (u) (v),but we don'tneed to. We read (3) (5) as 3 times5, but we readuv as"ttv"with no pausebetween the uand v, rather than as "u times y."

Division 6+3or9or 6/3 3

1o,l/p p 'o We ordinarily read | / p as I overp " rather than as o'I divided by pj' Quantities included in parenthesesare treated as one number' thus (2++t :3:L6. 3 ( + 8)(2)= (t2) (2) _ 24 (a + b), =ac+bc

We read (a + b) either as"o plus b" or as "the quantity a plusb," o'a if just saying phusb" might lead to confusion.Thus, if we said "c times a plus b" we might meanca + b, though we would read ca + D as "ca plus b." If we say"c timesthe quanttty a plus b," it is clearthat we meanc(a + b). The idea of a probability is used frequently in this book. We might say, for instance,that in a string of symbolsthe probability ' of the jth symbolis p(j ).We readthis "p of j." The symbols might be words, numbers, or letters. We can imagine that the symbolsare tabulated;various valuesof j can be taken as variousnumbers which refer to the symbols.Thble XVI, shows one way in which the numbers,/can be assignedto the let- ters of the alphabet. When we wish to refer to the probability of a particularletter, N for instance,we could, I suppose,refer to this asp(5), since5 refers to N in the above table. We'd ordinarity simply write /(N), however. What is this probability? It is the fraction of the number of On Mathematical Notation 28r

TenrE XVI

Valueof j Corresponding Letter I E 2 T 3 A 4 o 5 N 6 R etc. lettersin a long passagewhich arethe letter in question.Thus, out of a million letters,close to 130,000will be E's, So p(E)-ffi-13

Sometimeswe speak of probabilities of two things occurrirg together,either in sequenceor simultaneously.For instance,x may stand for the letter we send and y for the letter we receive.p@, y) is the probability of sendingx and receiving y. We read this "p-o,f x, / (we representthe comma by u pause).For instance,we might sendthe particularletter W and receivethe particular letter B. The probability of this particular event would be written /(W, B). Other particular examplesof p(x, y) arep(A, A), p(Q, S),p(E, E), etc.p(x,7) standsfor all suchinstances. We alsohave conditional probabilities. For instance,if I transmit x, what is the probability of receiving We write this conditional 'p 7? probabilitypr(/).We read this subx of y." Many authorswrite such a conditionalprobability p(y I x), which can be read as,"the probability of y given x." I have used the same notation which Shannonused in his original paper on communicationtheory. Let us now write down a simple mathematical relation and interpretit: p(x'y) - p(x) p,(y) That is, the probability of encounterirgx and y together is the probability of encounterirgx times the probability of encounter- in1/, when we do encounterx. Or it may seemclearer to say that 282 Symbols, Signals and l{oise the number of times we find x andy togethermust be the number of times we find x times the fraction of times that y, rather than some other letter, is associatedwith x. We frequently want to add many things up; we representthis by means of the summation sign >, which is the Greek letter sigma.S.rppose thatT standsfor an integer,so thati may be 0, l, 2, 3, 4, 5, etc. Supposewe want to represent 0 + 1+ 2 + 3 + 4 + 5 + 6 + 7 + 8 which of courseis equal to 36. We write this j - 8

j - 0

We read this, "the sum of i from/ equals0 toi equals8." The ) sign means sum. The7 _ 0 at the bottom meansto start with 0, and thej _ 8 at the top meansto stop with 8. The i to the right of the sign meansthat what we are summitg is just the integers themselves. We might have a number of quantitiesfor which / merely acts as a label. Thesemight be the probabilitiesof various letters,for instance,accorditg to ThbleXVII. If we wanted to sum these probabilitiesfor all letters of the alphabet we would write 26 , j-l

We read this "the sum of pU ) fromT equalsI to 26." This quantity is of courseequal to 1. The fraction of times A occursper letter plus the fractironof times B occurs per ietter, and so or, is the iraction of times per letter that any letter at aLloccurs, and one letter occursper letter. If we just write p(j ) j On Mathematical Notation 283 TnnrE XVII

Probability of Valueof j Letter Referred to Letter, p (j ) I E .r3105 2 T .10468 3 A .0815I 4 o .07995 5 N .07098 6 R .06882 7 I .0634s 8 S .06101 9 H .05259 l0 D .03788 ll L .03389 t2 F .02924 l3 C .02758 t4 M .02536 l5 U .02459 t6 G .01994 t7 Y .01982 l8 P .01982 t9 w .01s39 20 B .4t4/'0 2r V .00919 22 K .00420 23 X .00166 24 J .00132 25 a .00121 26 Z .00077 it means to sum for all values of j, that is, for all that represent something.We read this, "the sum of p(j ) overj." If j is a letter of the alphabet,then we will sum over, that is, add up, twenty-six different probabilities. Sometimeswe have an expressioninvolvirg two letters,such as i andi. We may want to sum with respectto one of these indices. For instance, p Q,j ) might be the probability of lett er i occurring followed by letter 7, &s,p(Q, V) would be the probability of encounteringthe sequenceQV. We could write, for instance 284 Symbols, Signals and Noise \r Z pG'i) l We read this, "the sum of p of i,7 with respectto (or, over)7." This says,let j assumeevery possiblevalue and add the probabilities. We note that

J This reads,"the sum of p of i,Jt over/ equalrp of i.'oIf we add up th. probabilitiesof a letter followed by everypossible letter we get just the probability of the letter, sinceevery time the letter orrltt it is followed by someletter. Besides addition, subtraction, multiplication, and division we also want to representa number or quantity multiplied by itself some number of times.We do this by writing the number of times tl. quantity is to be multiplied by itself aboveand to the right of the quantity; this number is called an exponent. 21:2 o'2to the first (or 2 to the first power) equals2." I is the exponent. 22:4 "2 squared,(or 2 to the second)equals 4." 2 is the exponent. 23=8 o'2 cubed(or 2 to the third) equals8." 3 is the exponent. 24: 16 oo2 to the fourth equalssixteen ." 4 is the exponent. We can let the exponent be a letter, n; thus, 2n, which we read "2 to the n," meansmultiply 2 by itself n times.en,which we read " o to the n," meansmultiply o by itself n times. To get consistentmathematical resultswe must say

Qo: I "a to the zeroequals 1," regardlessof what numbet o maybe. On Mathematical Notation 285

Mathematics also allows fractional and negative exponents. We should particularly note that

A-n : !o, l/a" fl 'oone We read s-n as " o to the minus n." We read | / a" as over c to the n." It is also worth noting that

AnAm : g(n+ml o'ato the n, a to the m equalsa to the n plus m." Thus 23X22_8X4=32-25 or 4L/zx 4t/2:4L:4. A quantity raised to the Vzpower is the squareroot 4L/2- the squareroot of 4 - 2. It is convenient to representlarge numbers by means of the powersof l0 or someother number 3.5X106-3,500,000 This is read "three point five times ten to the sixth, (or ten to the six).'' The only other mathematicalfunction which is referred to exten- sively in this book is the logarithm. Logarithms can have different bases.Except in instancesspecifically noted in Chapter X, all the logarithmsin this book havethe base2. The logarithm to the base2 of a number is the power to which 2 must be raised to equal the number. The logarithm of any number x is written log x and read "log x." Thus, the deflnition of the logarithm tcl the base2, asgiven above,is expressedmathematically by:

)lot, r : X oo2 That is, to the log x equalsx." As an example logS-3 23:8 286 Symbols, Signalsand Noise Other logarithms to the base 2 arc x log x I 0 2 I 4 2 8 3 l6 4 32 5 64 6 Some important propertiesof logarithmsshould be noted: logab - loga +Logb log a/b - log a - log b Iogd, - clogd As a specialcase of the last relation, log2* - mlogT: ln Except in information theory, logarithms to the base2 arenot used. More commonly, logarithms to the base l0 or the basee (e : 2.718approximately) are used. Let us for the moment write the logarithm of x to the base2 as logz x, the logarithm to the base l0 as logro x, and the logarithm to the basee as log" x. It is useful to note that

rogzx- (logz lo) (logrox) - l99tologrc {2 logzx - 3.32logro x Logzx : (logze) (1og , x) =t&+ logz x _ 1.44log.- x

The logarithm to the base e is called the natural logarithm. It has a number of simple and important mathematicalproperties. For instance,if x is much smaller than l, then approximately log"(l + x) - x Use is made of this approximation in Chapter X. In the text of the book, by log x we alwaysmean logz x. GIossary

ADDREss:In a computer, a number designating a part of the memory used to store a number, also the part of the memory which is used to store a number. ALrHABET:The alphabet, the alphabet plus the space, any given set of symbols or signals from which messagesare constructed. AMpLrruDE: Magnitude, intensity, height. The amplitude of a sine wave is its greatest departure from zero, its greatest height above or below zero. ATTENUATToN:Decrease in the amplitude of a sine wave during trans- mission. AUroMAroN: A complicated and ingenious machine. Elaborate clocks which parade figures on the hour, automatic telephone switching systems, and electronic computers are automata. AXIS: One of a number of mutually perpendicular lines which constitute a coordinate system. BAND: A range or strip of frcquencies. BANDLIMITED: Having no frequencies lying outside of a certain band of frequencies. BANDwIDTH: The width of a band of frequencies, measured in cps. BINARvDIGIr: A 0 or a 1.0 and I are the binary digits. BIr: The choice between two equally probable possibilities. BLocK: A sequenceof symbols, such as letters or digits. BLocK ENCoDING:Encoding a messagefor transmission, not letter by letter or digit by digit, but, rather, encoding a sequence of symbols together. BorrzuANN's coNSTANT: A constant important in radiation and other thermal phenomena. Boltzmann's constant is designated by the letter k. k - 1.37 X l0-23 joules per degree Celsius (centigrade).

287 288 Glossary

BnowNIAN MorIoN: Erratic motion of very small particles causedby the impacts of the molecules of a liquid or gas. cAPAcIToR: An electrical device or circuit element which is made up of two metal sheets, usually of thin metal foil, separated by u ihirt dielectric (insulating) layer. A capacitor stores electric chaige. CAPACITY:The capacity of a communication channel is equaf to the number of bits per second which can be transmitted bvJ meansof the channel. CHANNELvoCoDER: A vocoder in which the speechis analyzed by measur- ing its energy in a number of fixed frequency ranges or bands. cHEcK DIGITS:Symbols sent in addition to the number of symbols in the original message,in order to make it possible to detect the presence of or correct errors in transmission. CLASSTCAL:Prequantum or prerelativistic. COMMAND:One of a number of elementary operations a computer can carry out, a.9., add, multiply, print out, and so on. CoMPLICATEDMACHINT: An automaton. coNrAcr: A piece of metal which can be brought into contact with another piece of metal (another contact) in order to close an electric circuit. cooRDINATE:A distance of a point in a space from the origin in a direction parallel to an axis. In three-dimensional space,how far up or down, east or west, north or south a point is from a specified origin. coRE (magnetic): A closed loop of magnetic material linked by wires. Cores are used in the memory of an electronic computer. Magneti- zation one way around the core means l; magnetiiation the other way around the core means 0. cPS: Cycles per second, the terms in which frequency is measured. CYCLE:A complete variation of a sine wave, from maximum, to minimum. to maximum again. DELAY: The difference between the time a signal is received and the time it was sent. DETECTION THEORY:Theory concerning when the presence of a signal can be determined even though the signal is mixed with a Jpecified amount of noise. DIGRAM PROBABILITY:The probability that a particular letter will follow another particular letter. DIMENSION: The number of numbers or coordinates necessaryto specify the position in a space is the number of dimensions ln ttre rpur.. The space of experience has three dimensions: up-dowr, east-west, north-south. DIODE: A device which will conduct electricity in one direction but not in the other direction. Glossary 289

DTscRETEsouRcE: A messagesource which produces a sequenceof symbols such as letters or digits, rather than an electric signal which may have any value at a given time. DrsroRrroNlEss: Tiansmission is distortionless if the attenuation is the same for sine waves of all frequencies and if the delay is the same for sine waves of all frequencies. DouBLE-cuRRENrTELEGRArHv: Telegraphy in which use is made of three distinct conditions: no current, current flowing into the wire, and current flowing out of the wire. ELEcTRoMAGNETTcwAvE: A wave made up of changing electric and mag- netic fields. Light and radio waves are electromagnetic waves. ENERGvLEvEL: According to quantum mechanics, a particle (atom, electron) cannot have any energy, but only one of many particular energies. A particle is in a particular energy level when it has the energy and motion characteristic of that energy level. ENsEMBTB:All of an infinite number of things taken together, such as, all the messagesthat a given messagesource can produce. ENTRopy: The entropy of communication theory, measured in bits per symbol or bits per second,is equal to the average number of binary digits per symbol or per second which are needed in order to transmit messages produced by the source. In communication theory, entropy is interpreted as average uncertainty or choice, €.9., the average uncertainty as to what symbol the source will produce next or the averagechoice the source has as to what symbol it will produce next. The entropy of statistical mechanics measures the uncertainty as to which of many possible states a physical system is actually in. EeurvocATroN: The uncertainty as to what symbols were transmitted when the received symbols are known. ERGoDrc:A source of text is ergodic if each ensemble aver&g€,taken over all messagesthe source can produce, is the same as the correspond- ing average taken over the length of a message.See Chapter III. FTLTER:An electrical network which attenuates sinusoidal signals of some frequencies more than it attenuates sinusoidal signals of other frequencies. A filter may transmit one band of frequencies and reject all other frequencies. FrNrrE-srATEMAcHTNE: A machine which has only a finite number of different states or conditions. A switch which can be set at any of ten positions is a very simple finite-state machine. A pointer which can be set at any of an infinite number of positions is not a finite- state machine. 290 GIossary FM: , representing the amplitude of a signal to be transmitted by the frequency of the wave which is transmitted. FORMANT: In speech sounds, there is much energy in a few ranges of frequency. Strong energy in a particular range of frequencies in a speech sound constitutes a formant. There are two or three principal formants in speech. FREQUENCY: The reciprocal of the period of a sine wave; the number of peaks per second. GALVANOMETER:A device used to detect or measure weak electric currents. GAUSSIANNOISE: Noise in which the chance that the intensity measured at any time has a certain value follows one very particular law. HypERcuBE: The multidimensional analog of a cube. HypERSpHERE:The multidimensio nal analog of a sphere. INDUCTOR:An electric device or circuit element made up of a coil of highly conducting wire, usually copper. The coil may be wound on a magnetic core. An inductor resists changes in elictric current. INPUr SIGNAL:The signal fed into a transmission system or other device. JoHNsoN NOISE:Electromagnetic noise emitted from hot bodies; thermal noise. JOULE:A measure or amount of energy or work. LATENCY:Interval of time between a stimulus and the response to it. LINE SPEED:The rate at which distinct, different current values can be transmitted over a telegraph circuit. LINEAR:An electric circuit or any system or device is linear if the response to the sum of two signals is the sum of the responses which would have been obtained had the signals been appliid separately. If the output of a device at a given time can be expressed as the sum of products of inputs at previous times and constants which depend only on remoteness in time, the device is necessarily linear. LINEARPREDICTION: Prediction of the future value of a signal by means of a linear device. MAP: To assign on one diagram a point corresponding to every point on another diagram. MAxwELL's DEMON:A hypothetical and impossible creature who, without expenditure of energY, can see a molecule coming in a gas which is all at one temperature and act on the basis of tt ir information. MEMoRv: The part of an electronic computer which stores or remembers numbers. MEssAGn:A string of symbols; an electric signal. MESSAGEsouRcn: A device or person which generates messages. Glossary 291

NEGATTvEFEEDBAcK: The use of the output of a device to change the input in such a way as to reduce the difference between the input and a prescribed input. NEGATTvEFEEDBACK AMeLIFIER: An amplifier in which negative feedback is used in order to make the output very nearly u constant times the input, despite imperfections in the tubes or transistors used in the amplifier. NETwonr: An of resistors, capacitors, and inductors. NorsE: Any undesired disturbance in a signaling system, such as, random electric currents in a telephone system. Noise is observed as static or hissing in radio receivers and as "snow" in TV. NorsETEMnERATURE: The temperature a body would have to have in order to emit Johnson noise of any intensity equal to the intensity of an observed or computed noise. NoNLTNEARpREDrcrroN: Prediction of the future value of a signal by means of a nonlinear device, that is, any device which is not linear. oRrcrN: The point at which the axes of a coordinate system intersect. ourpur srcNAL: The signal which comes out of a transmission system or device. pERroD: The time interval between two successivepeaks of a sine wave. pERIoDIc: Repeating exactly and regularly time after time. nERrETUALMorroN: Obtaining limitless mechanical energy or work con- trary to physical laws. Perpetual-motion machines of the first kind would generate energy without source. Perpetual-motion machines of the second kind would turn the unavailable energy of the heat of a body which is all at one temperature into ordered mechanical work or energy. rHASE: A measure of the time at which a sine wave reaches its greatest height. The phase angle between two sine waves of the same frequency is proportional to the fraction of the period separating their peak values. pHAsEsHrFT: Delay measured as a fraction of the period rather than as a time difference. pHAsE spACE: A multidimensional space in which the velocity and the position of each particle of a physical system is represented by distance parallel to a separate axis. rHoNEME:A classof allied speech sounds, the substitution of one of which for another in a word will not cause a change in meaning. The sounds of b and p are different phonemes, the substitution of one of which for another can change the meaning of a word. 292 Glossary POTENTIALTHEORY: The mathematical study of certain equations and their solutions. The results apply to gravitational fields, to ,..tain aspects of electric and magnetic fields, and to certain aspects of the flow of air and liquids. PowER: Rate of doing work or of expending energy. A watt is I joule per second. PROBABTLTTY:In mathematics, a number between 0 and I associatedwith an event. In applications this number is the fraction of times the event occurs in many independent repetitions of an experiment. E.g.. the probability that an ideal, unbiased coin will turn up headsis .5. QUANTUM:A small, discrete amount of energy and, especially, of electro- magnetic energy. THEORY: QUANTUM Physical theory that takes into account the fact that energy and other physical quantities are observed in discrete amounts. RADTATE:To emit electromagnetic waves. RADIATION: Electromagnetic waves emitted from a hot body (anything above absolute zero temperature). RANDoM: Unpredictable. REDUNDANT: A redundant signal contains detail not necessary to deter- mine the intent of the sender. If each digit of a number is sent twice (l I 0 0 I linsteadof I 0 l)thesignilormessageisredundant. REGISTER:In a computer, aspecial memory unit into which numbers to be operated on (to be added, for instance) are transferred. RELAY:An electrical device consisting of an electromagnet, a magnetic bar which moves when the electromagnet is enelgrzed, urrd pairs of contacts which open or close when the bar moves. RESISTOR:An electrical device or circuit element which may be a coil of fine poorly conducting wire, a thin film of poorly conducting material, such as carbon, or a rod of poorly cond,rciit g material. A resistor resists the flow of electric current. SAMPLE:The value or magnitude of a continuously varying signal at a particular specified time. SAMPLING THEOREM:A signal of band width W cps is perfectly specified or described by its exact values at 2W equally spaced-times per second. SERVOMECHANISM: A device which acts on the basisof information received to change the information which will be received in the future in accordance with a specific goal. A thermostat which measures the temperature of a room and controls the furnace to keep the tem- perature at a given value is a servomechanism. SIGN: In medicine' something which a physician can observe, such as an Glossary 293

elevated temperature. In linguistics, a pictograph or other imita- tive drawing. srcNAL: Any varying electric current deliberately transmitted by an elec- trical communication system. srNEwAvE: A smooth, never-ending rising and falling mathematical curve. A plot vs. time of the height of a crank attached to a shaft which rotates at a constant speedis a sine wave. sTNGLE-cuRRENTTELEGRArHv: Telegraphy in which use is made of two distinct conditions: no current and current flowing into or out of the wire. spAcE: A real or imagin ary re1ion in which the position of an object can be specifiedby means of some number of coordinates. srATroNARy:A machine, or process,or source of text is stationary, roughly, if its properties do not change with time. See Chapter III. srATrsrrcAl MEcHANrcs:Provides an explanation of the laws of thermo- dynamics in terms of the average motions of many particles or the average vibrations of a solid. srArrsrrcs: In mathematical theories,we can specify or assignprobabilities to various events. In judging the bias of an actual coin, we collect data as to how many times heads and tails turn up, and on the basis of these data we make a somewhat imperfect statistical estimate of the probability that heads will turn up. Statistics are estimates 'othe of probability on the basis of data. More loosely, statisticsof a message source" refers to all the probabilities which describe or characterrzethe source. srocHAsTrc: A machine or any process which has an output, such as letters or numbers, is stochasticif the output is in part dependent on truly random or unpredictable events. sroRE: Memory. suBJECr:A human animal on which psychological experiments are carried out. syMBoL:A letter, digit, or one of a group of agreed upon marks. Linguists distinguish a symbol, whose association with meaning or objects is arbitraty, from a sign, such as a pictograph of a waterfall. syMproM: In medicine, something that the physician can know only through the patient's testimon), such as, a headache, as opposed to a sign. sysrEM: In engineering, a collection of components or devices intended to perform some over-all functions, such as, a telephone switching system. In thermodynamics and statistical mechanics, a particular collection of material bodies and radiation which is under consider- ation, such as, the gas in a container. 294 Glossary rEssARAcr: The four-dimensional analog of a cube, a hypercube of four dimensions. THEoREM: A statement whose truth has been demonstrated by an argu- ment based on definitions and on assumptions which are taken to be true. THERMALNorsE: Johnson noise. THERMODYNAMICS:The branch of science dealing with the transformation of heat into mechanical work and related matters. rorAl ENERGY:The total energy of a signal is its average power times its duration. TRANSISTOR: An electronic device making use of electron flow in a solid, which can amplify signals and perform other functions. vACuuM TUBE: An electronic device making use of electron flow in a vacuum, which can amplify signals and perform other functions. vocoDrn: A speech transmission system in which a machine at the trans- mitting end produces a description of the speech; the speechitself is not transmitted, but the description is transmitted, and the description is used to control an artificial speaking machine at the receiving end which imitates the original speech. wATr: A power of I joule per second. WAVEGUIDE: A metal tube used to transmit and guide very short electro- magnetic waves. wHIrE NOISE:Noise in which all frequencies in a given band have equal powers. zIPF's LAw: An empirical rule that the number of occurrences of a word in a long stretch of text is the reciprocal of the order of frequency of occurrence. For example, the hundredth most frequent word occurs approximately | / 100 as many times as the moit frequent word. Index

Abbott's Flatland, 166 Automa ta, 209, 227; defined , 219, 287; Absolute zero, 188,292 examples ol 287 Acoustics, 126; continuous sources in, Averages, ensemble, 58-59, 60; time, 59; network theory in, 6 58-59, 60 Addresses, defined, 222, 287; illustrated, Averback,8.,249 223 Axes, defined, 287; in multidimensional -169 Aerodynamics, 20; potential theory in, 6 spaces, 167 Aiken, Howard ,220 Ayer, A. J., on importance of commu- Algebra, 278; Boolean, 221 nication, I Alphabet, defined, 2871'see also Letters of alphabet Babbitt, Milton, xi Ambiguity, in sentences, I l3-ll4 Band limited, defined,287 ; signals, 170- 4 Arnplification, by radio receivers, 188- t82,272-27 l9l Bandwidth, l3l, 173-175, 188*189, 1921'amount of information trans- Amplifiers, 294; broad-band and nar- missible over, 40, 44; channel row-band, 188; gain in,2l7; Maser, capacity and, 178; defined, 38, 287; l9l; negative feedback, 2lG2l8, power and, 178; represented by 2'27 amplitude, l7l Amplitudes, l3l; defined, 31,287; in Bands, defined 287; line speed and, 38 pulse code modulation, 132-133; in , Bell, Alexander Graham, 30 samples of band-limited signals, I 16, ll7, 17l-174; seealso Attenuation Berkeley,Bishop, ll9 Binary digits, 206; alternative number of Antennas, 184-185; in interplanetary patterns determined by, 71, 73-74; communication, 19? Approximation s, see Word approxima- contracted to "bitr" 98; computers tions ond, 222,224; defined, 287; encod- Arithmetic, as mathematical theory, 7, ine of text in, 74-75,7G77,78-80, 8; units in computers,223 83-86, 88-90, 94-98; errors in trans- mission of, 148-150, 157-163; not Ashby, G. Ross,218 necessarily same as "bitr" 98-100; Attenuation, defined, 3J, 287; in distor- stored in computers, 2D, 223; in tionless transmission, 289; by filters, transmrssion of speech, 148* I 50, 289; frequency and, 33-34; number 157-163: "tree of choice" of, 73- of current values and, 38 74.99

295 296 Index

Binary systemof notation, decimalsys- Channelcapaci ty (Continued) temand ,69-70,72-73,76-77: ocial 156; measurementof, 164; with systemand, 7l, 73 messagesin two directions,275- Bit rate,defined, 100 276; of symmetrical ,.binary and unsym- Bits, 8, 66; as contraction of metricalbinary channels, 158-159, digit," 98; defined, 202, 287; as 164-165 measurement g0_96, of entropy, Channel vocoder, 138; defined, 288; 88-94, 98- 100; not necessarily illustrated, 137 sameas "binary digit,',98-100; iir Channels,capacity of, 97,98,106, 155- psychological experiments, 230_ 156,l5g-159, 164, 176, 275_276; 231;per quantum, lg1, 197 error-free,163; noisy, 107 , 145-165, Black,Harold,216 170*182,27 6; symmetricalbinary, Blake,William, ll7 t57*t59,t64*t65 Blockencoding, 77, 90, 177,182; check Checkdigits, 159-161, 165; defined,288 digitsin, 159-16 I ; defined,7 5 , 297; Checker-playingcomputers, 224, 225 error in, 149,156-157 Cherry,Colin, I l8 ,,distanceo, Blocks, defined, 75, 287; Chess-playingcomput ers, 224, 225 between, 16l-162; entropy and, Choice,in finite-statemachines, 54-56; 90-93, 94, 97 Huffman codeand, in langu&Be, 253-254; in message 97, l0l ; length ofl l0l - t03, 127 sources,62, 79-80, 8l; see also Bodies,forces or, 2-3; hot, 186, 207, Bits,Freedom of choice 290,292;in motio\, 2 Chomsky,Noam, ll2-115, 260;Syntac- Bolitho,Douglas,259 tic Structures,I l3n. Boltzmann,209 Ciphers, 64, 27l-272 Boltzmann'sconstant, 188, 202; defined, Circuits,accurate transmission by, 43; 287 contactsin, 288;linear, 33, 43-44; Booleanalgebra,22l *Breakifrg," relay,220, 221; undersea, 25 in modulationsystems, l8l Classical,defined, 288 Breakthrough,140 Codes,in cryptography,64, l 18, Z7l- Bricker,P. D., xi 272;erroi-coirecting, 159- 163, 165, Brooks,F.B.Jr.,259 276;Huffman, 94-97, 99, 100, l0l, Brownian motion,29; defined, 185, 288 105;in telegraphy,24-29; see also Encoding,Morse code Coding,see Encoding Communication,aim of, 79; as encod- C language,224 ing of messages,78; interplanetary , Cables,linearity of, 33; insulation of, 29; 196-197;quantum effects and, 192- transatlantic, 26, 29, l3g, 143; 196;see also Language voltagein,29 Communication theory (Information Cage,John, 255 theory),18, I 26,268-269; art and, Campbell,G. A., 30 250-267;ergodic sources and, 60- Capacitors,33; defined,5,2gg 61,63; as generaltheory, 8-9; as Capacity,channel, 97 ,98, 106,155-156, mathematicaltheory, ix-x, 9, 18, l5g-159, 164, 176, 275_276;de_ 60-6I , 63,27 8; multidirnensional fined,288; information, 97 geometryin, 170,l8l, 183;origins Carnot,N. L. S.,20 of, I , 20-441physics and, 24, I9B Channelcapacity, 107; band width and, psychologyand, 229-249;useful- 178; for continuous channel plus nessof,8-9,269 noise, 176-177;defined, 97, 106, Companding,defined, 132 164; entropy less than, 9g, 106; Compilers,224 errors in transmissionand, 155_ Complicatedmachine s, seeAutomata Index 297

Computers, 66, 209, 219, 287; cores in, Discretesignals, theory of (Fundamental 222,288; Datatron, 259; decisions theorem of the noiselesschannel), of, 221, as finite-state machines, 98, 106, 150-159,203 62; grammar for" I l5 ; literary Discrete sources, 66, 67: defined, 59, work by, 264; memories (stores)ol 289: noise and, 150-I 56 221-222, 290 music by, 225, 250, Distance,between blocks in code, l6l- 253, 259-261; prediction by, 210- r62 212; programming, 221-225; relay, Distortionless, defined, 34, 289 22V221; transistor, 220; "under- DNA (Desoxyribonucleic acid), 65 standing" in, 724; uses of., 224-226; Double-current telegraphy, defined, 27,, vacuum tube, 220; visual arts and, 289 2&-266 Dudley, Homer, 136 Contacts,defined. 288; in relavs,?92 Dunn, Donald A.,263 Continuous signals,encoding of, 66-68, 78, l3l-143,276; entropyof, l3l; Eckert,J.P.,220 frequency of, 67, 13I ; noise and, Edison, Thomas, quadruplex telegraphy 170-182,276; theory of, 170-182, of ,,27 -29, 38 203 Efficient encodin g, 7 5-76, 77, 1A4.,106, Coordinates, 167-169;defined, 288 146-147, 276: by block encoding, Cores, rnagnetic, addresses in, 222; l0l*103, 156-157;for continuous defined, 288 signals, l3 I - 143,,27 6: by Huffman Costello,F. M., xi code, 94-98, l0 I , 105; in Morse Cpr, defined, 3 1, 288 code, 43, 13I ; number of binary Cryptography, theory of , 27 l-272; see digits needed in, 78-80, 88; prin- also Codes ciples of, I 42; for TV transmission, Currents,electric, detection of ,27,290: l3 I, 139-143, 27 6; for voice trans- induced by magnetic field,29n., Max- mission, l3l ; seealso Encoding well on, 5n.; in telephony, 30; values Effort, economy of, in language, 238- ol 28, 36-38, 44; see also Double- 239, 242 current telegraphy, Noise, Signals, Einstein, Albert,, 145, 167; on Brownian Single-currenttelegraphy motion, 185; on Newton, 269 Cybernetics, 4l; described, 208-210, Electromagnetic waves, 5n.; defined, 226-227; etymology of, 208 289; generation of, 185-186; on Cycles,Carnot, 20; defined, 31, 288 wires, 187-188 Electronic digital computers, see Com- puters Datatron computer,259 Encoding, 8, 76,78; "best" way of,,76, Decimal systemof notation, 69, 72-73 77,, 78-79; into binary notation, Delay, defined,33, 288; in distortionless 74-75,76-77, 78-80, 83-86, 88-90, transmission, 289; see also Phase 94-98; channel capacity and , 97- shift 98; of continuous signals, 66-68, Detection theory, 208, 215:'defined, 288 78, l3l-143, 276: cryptographic, Dielectrics, in capacitors, 5, 288 64-65; dangers in, 143-144; of Digram probabilities,51, 56, 57, 92-93; English text, 56, 74-75, 76-80, 88, defined,50, 288 94-98, l0l-106, 127-129;in FM, Dimensions, defined, 288; four, 166- 65, 178-I 79: of geneticinformation, 167; infinity of, 167-170; phase 64-65; Hagelbarger's method of, spacesas, 167; two, 166; up-down, 162; by Huffman code, 94-97, 100, east-west,north-south, 288; seealso 105, 128; into Morse code, 24- One-to-one mapping 27, 65; of nonstationary sources, Diodes, defined,288 59n.; noiseand, 42,44, 144,276: 298 Index

Encodin g (Continued) Ergodic sources,defined, 59,63; English by patterns of pulses and spaces, writers &s, 63; as mathematical 68-69, 78:' physical phenomena models, 59-61, 63, 79, 172; proba- and, 184; quantum uncertainty and, bility in, 90, l7Z t94-197; of speech, 65, t3t-139, Errors, correction of, 159- 163, 165; 276; in telephory, 65, 27 6-277; word detection of, 149-150; in noisy by word, 93, 143; - see also Block channels, , 147 165; reduction encoding, Blocks, Efficient encoding through redundancy ol 149-150, Encrypting of signals, 276 163, 164-165; in transmission of Energy, 17l; electromagnetic, 185; as binary digits, 148- I 50, lS'7-163; formants, 290; free, 202, 204-206, see also Check digits, Noise 207; of mechanical motion, 185; Exponents, defined, 284 organized, 2A2; ratio, in signal and noise samples, 175-179, 182; ther- Fano, 94 mal and mechanical, 2A2; seealso Feedback, see Negative feedback Quanta, Total energy Fidelity criterion, l3l , 274 Energy level, defined,203,289 Filtering, 208, 210, 215 English text, encoding of, 56, 7 4-75 , Filters, 227; defined, 41,289; in smooth- gg, g4_gg, 76-90, l0l_106, 127_ ing,215; in vocoders,136-138 129; see also Letters of alphabet, Finite-state machines, defined, 289; de- Word approximations, Words scription of, 54-56; electronic digi- Eniac,220 tal computers as,62; entropy of,93- Ensembles,41, 42; defined, 57,289 94, I 53; grammar and, I 03- I 04, Entropy, l05i per block, 91, 106; chan- I I l-112, ll4-l l5; illustrated, 55; nel capacity and. 98, 106; in com- men as,62, 103; randomness in,62 munication theory, 23-24, 80,2A2, Flatland, 166 206,207; conditional, 153; of con- Flicker, in motion pictures and TV, l4l tinuous signals, l3l; defined,66, FM, "breaking" in, l8l ; defined, 290; 80,202,289 estimatesof, 91, 106; encoding in, 65, 178-l7g of finite-state machines,93-9 4, 153: Foot-pound, l7l formulas for, 81, 84, 85; grammar Formants, defined,290 and, I 10, I l5; highesrpossible ,97, Fortran, 224 106; Huffman code and,94-98: Fourier, Joseph, analyzes sine waves, human behavior and, 229; of 30-34, 43*44 idealized gas, 203-205i per letter of Freedom of choice, 48; entropy and, 8 I , English text, l0l-103, lll, 130; see also Choice measured in "bits,'o80-81 ; message Frequency, l3 I ; defined, 31, Z9A; of in- sources and, 23, 8l-85 , BB-94, 105, put and output sine waves, 33; of 206; in model of noisy communica- letters in English, 47-54, 56, 63; tion system, 152-155; per note of quantum effects and, 196; in speech, music, 258-259; in physics (statisti- I 35- 138; of TV signals, 67; of voice cal mechanics and thermodynam- signals, 67; of white noise, 173 ics),2l-23, 80, 198,202,206,207; Frequency modulation, see FM -86; probability and, 8 I reversibility Friedman, William F., 48 and, 2l-22; of speech, 139; per Fundamental theorem of the noiseless symbol ,91,94, 105 channel, 106; analogous to quan- Equivocation, defined, 154, 289: in tum theory, 203; reasoning ol 150- enc-iphered message,272; entropy 159, 163, 164, 1651'stated, 98, 156 and, I 55, 164; in symmetrical binary channels, 155, 164 Gabor, Dennis, "Theory of Communi- Ergodic, defined, 289 cation," 43 Index 299

Gain, in amplifiers, 217-218 Hydrodynamics, 20 Galvanometers, defined, 27, 290 Hyman, Ruy, experiment on informa- Gambling, information rate and, 270- tion rate by, 230, 232, 234 27r Hypercubes, defined, 290 Games,as illustrationsof theoremsand Hyperquantization, defined, 132, 142; proofs,l0-14; playedby compu- effects of errors in, 149 ters,224, 225,226 Hyperspace, 170; amplitudes as coordi- Gas, 62, 293; as exampleof physical nates of point in, 172-175 system,203-206; ideal expansion Hyperspheres,defined, 290; volume of, of, 20n.; motion of moleculesin, 170, 173 185 Gaussiannoise, 177, 276; defined,173- Ideas,general, I l9-120 174,276; monotony of, 251;same Imaginary numbers, 170 as Johnsonnoise, 192 Inductors, 33; defined 5, 290 ,information , theoryand, 64-65 Infinite sets, l6 Geometry,4; computersand ,224,225 Information, definitions of, 24; as han- Euclidean,7 l4; multidimensional, , dled by man, 234; learning and, 166-170; in problemsof continuous 230-237, 248-249; measurement of signals,l8l, 182 amount of, 80; uncertainty and, Gilbert,E. N., xi 24; per word,254 Golay,Marcel J. 8., 159 lnformation capacity, 97 Governors,as servomechanisms,209, Information rate, 9l in man, 230-237 2t5-216 ; , 238-239,250-251 see also Entropy Grammar, 253-254;in Chomsky,ll2- Information theory, see Communication I l5; entropyand, I10, I l5; finite- theory statemachines and, 103-104, I I l- Input signals, defined, 32,290 ll2, ll4-l 15; meaningand, ll4- Institute of Radio Engineers, I I16, 118; phrase-structure,I l5; Insulation, of cables, 29; rn capacitors, rulesof, 109-ll0 5, 288 Guilbaud,G. T., 52 Interference, current values and, 38; in- tersymb ol, 29; see also Noise Hagelbarger,D. W., 162 Intersymbol interference, defiiied, 29 Hamming,R. W., 159 lsaacson, L. M ,, 259 Harmon,L. D., 120 Hartley,R.V. L., 42; "Transmissionof lnformation,"39-40, 17 6 Jenkins, H. M., xi Heat,motion and, 185-186; waves, 186, Johnson,J. B., 188 187;see also Temperature, Thermo- Johnson noise (Thermal noise), 196-199; dynamics defined, 290; formulas for, 188- 189, Heaviside,Oliver, 30 195; same as Gaussian noise, 192; Heisenberg'suncertainty prin ciple, I94 as standard for measurement, 190 Heftz, Heinrich Rudolf, 5 Joule, defined, 17l, 290 Hex (a game),l0- I 3 Julesz, Bela, 264 Hiller,L. A. Jr.,259 Homeostasis,defined, 2 I 8 Hopkins,A. L. Jr.,259 Karlin, J. 8., 232 Horsepower,defined, l7l Kelly, J. L. Jr.,270 Hot bodies,186,207, 290, 292 Kelvin, Lord, 30 Howes,D. H.,240 Kelvin degrees,defined, 188 Huffmancode, 94-97,99,l0l, 105,128 Klein, Martha,259 prefix propertyin, 100 Kolmogoroff, A. N., 41,42, 44,214 300 Index

Language,basic, 242; choice in, 253- Mandelbrot, Benoit, xr, 240, 246-247 254; classificationin, 122-123;as Map, defined, 290 codeof communication,118; de- Mapping, continuous, 16, 179; one-to- fined, 253; economy of effort in, one, 14*15,16-17, 179 238-239,,242;emotions and, l l6- Mars, transmission from vicinity of, ll7; euphonyin, I 16-ll7; every- t92*t94 day vs. scientific,3-4,l2l; experi- Maser amplifier, 19l, 194 ence as basis for, 123, 242, 245; Mathematical models, function of, 45- humanbrain and, 242,245;mean- 47; ergodic stationary sources &s, ing in, I l4- 124;random and non- 59-61, 63, 79, 172; in nerwork random factors in, 246;understand- theory, 46; to produce text, 56 ing and, ll7, 123-124;see also Mathematics, intuitionist, l6; new tools 's Grammar, Words,Zipf law in, 182: notation in, 278-279; po- Latency, defined, 290; as proportional tential theory in, 6; purpose of, l7; to information conveyed, 230-232 theorems and proofs in, 9-17; tn Learning,experiments tn, 230*237,248- words and sentences,278-279 249;by machines,261 Mathews, M. Y., xt, 253 's Leasteffort, Zrpf law and,238-239 Mauchly , J. W.,220 Lettersof alphabet,binary digit encod- Maxwell, James Clerk, 198, 209; Elec- ing of, 74-75, 78, 128-129;entropy tricity and Magnetism, 5n., 20; p€r, l0l-103, I I l, 130;frequency equations of, 5-6, l8 of,47-54, 56,63;predictability ol Maxwell's demon, 198-200; defined, 47-52;sequences of, 49-54,63,79; 290; illustrated, 199 suppressionof, 48 Meaning, language and , ll4-124 Licklider,J. C. R., 232 Memories, 221-222; addressesin, 222, Line speed,defined, 38, 290 287; defined, 290 Linear prediction, 2ll-212, 227; de- Memory, in man, 248-249 fined,290 Merkel, J., 230 Linearity, defined, 32-33, 290; of cir- Message sources,8; basis of knowledge cuits,33, 43-44 of, 6l ; choice in, 62, 79-80, 8l ; de- Liquids,motion of moleculesin, 185 fined, 206, 290; discrete, 59, 66, 67, Logarithms,bases of, 285; explained, 150-156,289;entropy and, 23,81- 36, 82, 285-286;Nyquist's relation 85, 88-94, 105,206; ergodic, 57-61, and,36-38 63, 79, 90, 172; intermittent, 9l; natural structure of, 128; nonsta- tionary, 59n.; simultaneous, 80, Machines,desk calculating, 222; finite- 9l-92; station ar! , 57 -59, 172-173 state, 54-56, 62, 93-94, 103-1A4, statistics of, 293; as "tossed" coins, I I I - ll2, ll4- I 15, 153,289; learn- 8l*85; seealso Signals ing by, 261; pattern-matching,120; Messages, defined, 290; increase and perpetual-motion, 198-202, 206, decreaseof number of, 8l. see also 2911,translation, 54, 123,225;Tur- Signals ing, ll4; seealso Computers Miller, George A.,248 Man, behavior of, 126, 225; emotions Minsky, Marvin,226 in, I l6- ll7:. entropy and, 229;as Modes, 195 finite-statemachine, 62, 103;infor- Modulation, improved systems of, 179, mation rate rn, 230-237,238-239, 277; noise and, 180-l8l ; seealso 250-251; memory in, 248-249; as FM, Pulse code modulation messagesource, 61, 103-104;nega- Molecules, Maxwell's demon and, 198- tive feedbacktn,2l8, 227: seealso 200; motion of, 185-186 Language Morse, Samuel F. B., 24-25,42,43 Index 301

Morse code, 24-25, 40, 65,97, 129, l3l: Noise (Continued) limitations on speed rn,25-27 telegraphy,38, 147; in telephony, Motion, Aristotle on, 2; Brownian,29, 145, 162, 291; in teletypewriter I 85, 288; energy of, I 85; of mole- transmission, 146: see also Gaus- cules, 185-186; Newton's laws of, sian noise, Johnson noise, White 2-3,4-5,g, 19,20,203 noise Mowbray,G. H.,231 Noise figure, formula for, l9l Multiplex transmission, defined, 132 Noise temperature, defined, 291; as Music, 65,66; appreciationof ,251-252; measure of noisiness of receiver, composition of, 250-253; electronic r90-r92 compositionof , 224, 225,250,253, Noiselesschannel, theorem of, 98, 106, 259-261; Janet compiler in, 224; 150-159,163, I 64, 165,203 rules of , 254-2551'statistical com- Nonlinear prediction, 213-214; defined, position of, 255-259 29r Numbers, imaginary, 170; see also Nash, see Hex Binary digits, Binary system of no- Negative feedback, 209,227; in animal tation organisms, 2 I 8; defined, 215,,291; Nyquist, Harry, "Certain Factors Af- as element of nervous control, 219, fecting Telegraph Speed," 35-39, 227; linear, 216; nonlinear, 216: 176, 182; "Certain Topics in Tele- unstability in, 216,227;uses of, 218 graph Transmission Theory," 39; Negative feedback amplifiers, 21 6-218, on Johnson noise, 188 227; defined, 291; illustrat ed, 217 ,as finite-statemachine, Off-on signals, encoding by, 68-77 ; see 62; negative feedback and, 219,227 also Single-current telegraphy Network theory, 8, l8; in acoustics,6; One-molecule heat engine, 200-201, 204 defined, 5-6; generality of, 6-7; One-to-onemapping, 14-15, l6-17, 179 linearity in, 33; mathematical Origin, defined, 291; see also Points models in, 46 Output signals,defined, 32,291 Networks, 5; defined, 5,291; as filters, 289; simultaneous transmission of Parabolic reflectors, 189 messagesover, 275-276 Particles, indistinguishability of, 7; Neumantr, P. G., 259 energy level ol 289 Newman, James R., ix, xi; World of Period, defined, 31,291 Mathematics, x Periodic, defined, 31, 291 Newton, fsaac, 14, 269; laws of, 2-3, Perpetual motion, defined, 291; ma- 4-5,8, lg, 20, 145,203 chines, 198-202,206 Noise, 43, 207; attempts to overcome, Phase,defined, 31, 291 29, 146-165, 170-182; "breaking" Phase angle, defined, 291 to, l8l ; causes of, 184-185; con- Phase shift, defined, 33, 291; see also tinuous signals and, 170- I 82; in Delay discrete communication systems, Phase space, 167; defined, 291 150-156; electromagnetic,188- 190; Philosophy, Greek, 125; as reassurance, encoding and, 42, 44, 144,276; ex- l17 traction of signals from, 42, 44; Phoneme, defined, 291; vocoders, 138 number of current values and, 38; Phrase-structuregrammar, I l5 power required with, 192; in radar, Physics, 125-126; communication 4l-42,213-214; in radio, 145, 184- theory and, 24, 198; theoretrcal, 4 I 85; I 88- l9 I , 207, 291; from ran- Pinkerton, Richard C., 258 dom error in samples,l3l-132; as Pitch, in speech,135 "snow" in TV, 144, 146,,291; in Plosives,135-136 302 Index

Poe, Edgar Allen, The Gold Bug, 64; Quanta, defined, 292; energy of, 196; rhythm of, I 16*117 transmission of bits and,' lg7 Poincard, Henri, 30 Quantization, defined, 68,, 132; in TV Points, in multidimensional fpace, 167- signals, 142; see also Hyperquanti- 169, 172, 179, l8l, 182: see also zation Origin Quantum theory, 125*126, 203; defined, Pollack, H. O. ,273-274 292; energy level in, 289; unpredic- Potential theory, 6; defined,292 tability of signals in, 196 Power, average, 172-173; band width Quastler,H.,232 and, 178; concentrated in short pulses, 177; defined , 292; as meas- Radar, noise in, 4l-42,, 213-214;predic- ure of strength of signal and noise, tion by ,210-2 14,227 17l; noise, 175-179, 182, I 89, 192_ Radiate, defined, 186, 292 193; ratio (signal power to noise Radiation, defined, 186, 292; equilib- power), 175-179, 182, 19l ; receiver, rium of, 187; rates of, 186 seealso 193; signal, 173-179, 182, 189, 192; Quanta transmitter, 193 Radiators, good and poor, 186 Predictability, of chance even ts, 46-47; Radio,noise in. 145,184-185, 188-191, of letters of alphabet, 47-52, 56, 207, 291 283; randomness and, 6l-62: see Radio telescopes,I 89- 190 also Probabilitv Radio waves, as encoding of sounds of Prediction, linear,- 2ll-212, 227,, 290; speech, 76; see also Electromag- nonline ar,213-214, 291; from radar netic waves data,2rc-214,227 Random, defined,292 Prefix property, defined, 100 Ratio, of signal power to noise power, Probability, 293 conditional, 5 l, 281: 175-179,lg2, lgl -86; defined, 292; entropy and, 8 I Reading speed, 232-237 , 238-Z4l ; rec- of letters in English tex t, 47-52, 56, ognition and,24A 283; in model of noisy communica- Receivers,efficient, 184-185; measure- tions system,l5 I - 153; of next word ment of performance of, l9O-192 or letter in message,6l -62; nota- Recognition, and reading speed,240 tion for, 28A-284; in word order, Redundancy, 144; defined, 39,,143, 292; 86*87; see also Digram probabili- as means of reducing error, l4g- ties, Predictability, Tiigram proba- 150,163.164-165 bilities Registers,222; defined, 292 Prtrgrams (for computers), 62, 221-225; Relays, in circuits, 220, 221; defined, decisions in, 221; compilers for, 292; malfunction of, 147 224 Registers, 223; defined,292 Prolate spheroidal functions, 274 189; as source of noise power, 189 Proofs, by computers, 224, 225; con- Response time, see Latency structive, 13, I 5- l6; illustrated by Reversibility, indicated by entropy , Zl- "hex," l0-13; illustrated by map- 22 ping, 14-17:' nature of, 9-17 Rhoades,M. V., 231 Psychology, communication theory and, Riesz, R. R., 240 229-249 RNA (Ribonucleic acid), 65 Pulsecode modulation, 132, 138, 142, Runyon, J. P., xi 147,276;in musicalcomposition, 250 Samples, 78; of band-limited signals, Pupin,Michael, 30 l7l-175, 272-274; defined, 292; fidelity criterion and, l3 I ; interval between, 66-68; random errors in, Quadruplextelegraphy, 27 -29, 38 t3t-t32 Index 303

Sampling theorem, defined, 66-67; Sources, see Message sources problemsin, 272-274 Space,defined,293; dimensions in, 166- Scanning,h TV, 140 170;Euclidean, I 69; function, l8l ; Science,history of, 19-21; informed phase,167 ,291; signal,l8 I ignorance in, 108; meaning of Spaceships, transmissionfrom, l9Ll97 words in, 3-4, 18, l2l Speech,encoding ol 65, l3 I - I 39,276; Secrecy,through encrypting of signals, entropy of, 139; frequenciesin, 276; seealso Codes 135-138;pitch in, 135,136; recog- Sentences,ambiguous, I 13-ll4; genera- nition of, by computers,224,225; tion of, I l2-ll5 samplesin representationoe 67- Servomechanisms,209; defined, 215, 292 68; voicedand voiceless,135-136; Shannon,Claude E., xi, 43, 87-89,93, seealso Language,Words 94,98,l2g-130, 146, 147,221,254, Speed,line, 38; reading,232-237,238- 270, 274, 276; "Communication 241;of transmission, 24-27 ,36-38, Theory of SecrecySystems,n' 27l- 44,130-131, 155-l 56, 164 272; estimateof entropy per letter Sperling, G., 249 of English text by, t0 i -i Of, I I l, Stadler,Maximilian, 255 130; fundamentaltheorem of the Stars,radio noisefrom, 190 noiselesschannel of, 98, 106,150- States,203-206, 207; defined, 203; see 159, 163, 164, 165, 203; "Mathe- alsoFinite-state machines matical Theory of Communica- Stationary,defined, 57-59, 293 tion," ix, I ,9, 4l-42; useof multi- Statistical mechanics, defined, 293; dimensionalgeometry by, 170,173 entropy in, 22, 202, 206, 207 Shannotr,M. E. (Betty),255,266 Statistics,defined, 293; of a message Shepard,R. N., xi source,293 Signals,analogo 140; band-limited, 170- stibitz,G. R., 220,221 182,272-274; broad-band, 143; Stochastic,60, 63; defined, 56, 293; choiceof "best" sort of, 42; con- music, 255-259 tinuous,66-68, 78, l3 l- 143,170- Stores,221-222; define d, 293 182,2A3, 276; defined, 196, 293; Subject,defined, 230, 293 demodulated,l8l ; discrete,66, 67, Summationsign, explained, 282-284 98,106, 150-159, 163,164,165, 203; Switching systems,219-220, 224, 227, encryptingof, 276; energyof, 172- 287 175, 182, 294; faint, 214; future Syllables,reading speedand, 234-236 valuesof ,209; input and output,32, Symbols,as current values,28, 36-38; 290-291;as points in multidimen- defined, 293; probability of occur- sionalspace, 172, 179, l8l, 182;in renceof, 88-94; primary and sec- quantum theory, 196;representa- ondary, 40; selectionof, 39-40 tion oq by samples,66-68; unpre- Symptoms,I 19; defined,293 dictability of, 196;on wires, 187- Systems,202-203; defined,293; deter- 188;see also Messages, Modulation, ministic, 46-47; switching, 219-220, Redundancy,Samples 224,227 Signs,I 19; defined,292-293 Systemsof notation, binary, 69-77; Sinewaves, amplitude of, 31,287;cycles decimal, 69-70; octal,7 | in, 31, 288; defined, 31, 293 Szilard,L.,21, 198 Fourier'sanalysis of, 30-34, 43-44; period of, 3 I , 291; see also Fre- quency Telegraphy,noise in, 38, 147;Nyquist Single-currenttelegraphy (otr-on teleg- oo, 35-39, 176, 182; as origin of raphy), defined, 27, 293 theory of communication, 20; Slepian,David, xi, I 62, 214,256, 274 quadruplex, 27-29, 38; speed of, Smoothing,208, 210, filters in, 215 25-27, 36-38; and telephonyon 304 Index

Telegraphy(Continued) Transistors,33; in comput ers,220; de- samewire, 38; seealso Double-cur- fined,294; malfunction of, 147 rent telegraphy, Single-current Tianslatingmachines, 54, 123,225 telegraphy Tiansmission,accurate , 127;of continu- Telep_hony,126; continuous sources in, oussignals in thepresence of noise, 59; currentsin, 30; encodingin, 65, 170-182; distorti,onless,34, 2g9; 276-277;intercontinental relay in, error-free,163- 164; length of time 277; military, 132;multiplex trans- ol 40, 175: microwau-e,277; of mission in, ] 32; negativefeedback Morsecode, 24-27; multiplex, 132; in, 218; noise in, 145, 162, 291: "predictors"in, 129-130irates of, ratesof transmissionin, 130-13l; 24-27,36-38,44, 130-131, 155-156, gwltchingsystems in, 219-220,224, 164:of samples,13l- 132;of TV 227,287;and telegraphyon same signals,139-i 43,194; in twodirec- wire, 38 tions,33, z7s-276; from vicinityof Telescopes,radio, 189-190 Mars, t92-194 Teletyp:s, .errors in type-setting,149: "Treeof choice,"binary digitsand, 73- noisein, 146 74,99 Television, approximation of picture Tiigram probabilities, 5l-52, 93 signal in, l4l; color, 140, 142 ef- Tuller, W. G., 43 ficient encoding in, 139- 143,276; Turing, A. M.,226 errors in transmission over, 132: T[ring machines, ll4 intercontinental relay rn, 277; inter- planetary, 194; quantization in, Uncertainty, Heisenberg's principle of, 132; rates of transmission over. 194; information as, 24; of message 130-l3l; scanningin, 140; "snow" received, 79-80, 105, 164; see also ' in, 144, 146. 291 Entropy Temperature, in degrees Kelvin, 188; Understanding, language and, ll7 , lZ3- radiation and, 186-187; radio noise r24 power and, 189-190;of sun, 190 Unstabilityof feedbacksystems , 216, Tessaracts,defined, 167, 294 227 Theorems, defined, 294; illustrated by Vacuumtubes, 33; in computers,220; "hex," l0- l3; illustrated by map- defined,294; malfunction of, 147 ping, 14-17; nature of, 9-17; proven Visual arts, appreciationof, 266-267; by cornputers, 124,126,ZZ4:225 computersand , 264-267 Theory, classification ol 7-8; function Viterbi, Andrew J., lgz of 4, 18, 125; general vs. narrow, , Vocabulary,size and complexity 4-7; physical of, Z4Z- vs. mathematical, 245 6-8; potenti al, 6, 292; of relativity, Vocoders,defined, 294; described, 136- 145; unified field, 5; seealso Com- 142,rllustrated, 137 typeso{ 138 munication theory, Network ; theory, Voice,see Speech theory Quantum Voltage,in cables,29; in detection Thermal equilibrium, 199-200 of faint signals,214-215 Thermal noise, seeJohnson noise Volume,in multidimensional Thermodynamics, 19, geometry, 293; defined, 294; 169-t73 entropy in, 2l-23, I 98, 202, 207 ; Von Neumann,John, 221-ZZz laws of, 198-199, 206, 207 Thermostats, 216, 227, 292 Watt,defined, 171,294 "Thinkiog," by computers, 226 Waveguides, 195, 294 Thomson, William (Lord Kelvin), 30 Waves,heat, 186,187; light, 186,lB7 , Tinguley, Jean, 266 196; radio, 76; of speechsounds, Total energy, free energy in, 202; of I33- 135;see also Electromagnetic signal, 172-175, 294 waves.Sine waves Index 30s

White noise, defined, 173, 294; monot- Word-by-wordencodin g, 93, 143 ony of, 251 Words,associations to, I 18-l l9; binary- Wiener, Norbert, 41, 44, 227; Cyber- digitencoding of, 7 5,77 ,78, 86-88, netics,41,208,219: I Am a Mathe- 129:'determined by qualities,I l9- matician, 209, 210 l2l; sequencesof, 53-56,79,llA- Wolman, Eric, xi I I l, 122;as used in science,3-4, 18, Word approximations, to English text, l2l; seealso Langua ge, Zipf 's law 48-54,96, 90, I l0-l I l, 246,261_ Wright, W. V., 259 264; first-ord er, 49, 53, 86, 246, 261: fourth-order, ll0-lll; to Latin, 52; second-order,5l , 53, 90, 262; 's third-order, 5l-52, 90; zero-order, Zipt law, 238-239;defined,294; illus- 49 trated,87 , 243;other data and,247

About the Author

Dr. John R. Pierce was born in Des Moines, fowa, in 1910 and spent his early life in the Midwest. He received his undergraduate education at the California Institute of Technology in Pisadena, and his M.A. and Ph.D. degreesin from the same institution. In 1936 he went to the Bell Telephone Laboratories, and was Executive Director, Communication SciencesDivision when he left in 197| to become Professor of Engineering at the California In- stitute of Technology. He is now Chief Technologist at Caltech's Jet Propulsion Laboratory. Dr. Pierce's writings have appeared in Scientific Americon, The Atlantic Monthly, Coronet, and several science fiction magazines. His books include Man's world of Sound (with E. E. David); Electrons'Waves and Messag€s;Waves and the Ear (with W. A. van Bergeijk and E. E. David); and severaltechnical books, includ- ing Traveling Wave Tubes, Theory and Design of Electron Bearns, Almost AII about Waves and An Introduction to Communication Scienceand Systems(with E. C. posner). Dr. Pierce is a member of several scientific societies, including the National Academy of Sciences,the National Academy of Engi- neering, the American Academy of Arts and Sciences,the Royal Academy of Sciences(Sweden), and is a Fellow of The Institute of Electrical and Electronic Engineers, the American Physical Society and the Acoustical Society of America. He has received ten honorary degreesand a number of awards, includirg the National Medal of Science, the Edison Medal and the Medal of Honor ( IEEE ) , the Founders Award (National Academy of Engineering) , the Cedegren Medal (Sweden) and the Valdemar Poulion Medal (Denmark).