SETL for Internet Data Processing
Total Page:16
File Type:pdf, Size:1020Kb
SETL for Internet Data Processing by David Bacon A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science New York University January, 2000 Jacob T. Schwartz (Dissertation Advisor) c David Bacon, 1999 Permission to reproduce this work in whole or in part for non-commercial purposes is hereby granted, provided that this notice and the reference http://www.cs.nyu.edu/bacon/phd-thesis/ remain prominently attached to the copied text. Excerpts less than one PostScript page long may be quoted without the requirement to include this notice, but must attach a bibliographic citation that mentions the author’s name, the title and year of this disser- tation, and New York University. For my children ii Acknowledgments First of all, I would like to thank my advisor, Jack Schwartz, for his support and encour- agement. I am also grateful to Ed Schonberg and Robert Dewar for many interesting and helpful discussions, particularly during my early days at NYU. Terry Boult (of Lehigh University) and Richard Wallace have contributed materially to my later work on SETL through grants from the NSF and from ARPA. Finally, I am indebted to my parents, who gave me the strength and will to bring this labor of love to what I hope will be a propitious beginning. iii Preface Colin Broughton, a colleague in Edmonton, Canada, first made me aware of SETL in 1980, when he saw the heavy use I was making of associative tables in SPITBOL for data processing in a protein X-ray crystallography laboratory. Accordingly, he loaned me B.J. Mailloux’s copy of A SETLB Primer by Henry Mullish and Max Goldstein (1973). I must have spoken often of this language, because when another colleague, Mark Israel, visited the University of British Columbia two or three years later, and came across a tutorial entitled The SETL Programming Language by Rober Dewar (1979), he photocopied it in its entirety for me. The syntactic treatment of maps in SETL places its expressive balance closer to al- gebraic mathematics than is customary for programming languages, and I immediately started finding SETL helpful as a notation for planning out the more difficult parts of programs destined to be coded in SPITBOL, Algol 68, Fortran, or Assembler/360. When I took my M. Sc. in 1984–5, my enthusiasm about SETL was such that I made a presentation about it to my Advanced Programming Languages class. Later, when I was invited to stay on for my Ph.D. at the University of Toronto, the reason I declined to do so was specifically because no one there was willing to supervise work on SETL, and I for my part did not want to get caught up in either the logic programming or the iv (really very good) systems programming tradition prevailing there at that time. So it was that in 1987, not having access to any good SETL implementation (the CIMS version seemed to crash on the first garbage collection attempt on the main- frame I was using), I decided to dash one off in SPITBOL. It was a good learning exer- cise. The compiler seemed to work correctly to the extent it was tested, and produced runnable SPITBOL code, but never saw much practical application on the hardware of the day. In December of 1988, I decided to write a production-grade SETL compiler in C, and thus began an implementation that I use to this day and continue to extend. From the start, I found maps useful in the combinatorial programming I was doing in molec- ular modeling. The implementation is packaged in such a way that its most common invocation mode (“compile and go”) is via the shell-level command setl, making SETL programs easy to fit into Unix filter pipelines, just like other scripts or pre-compiled programs. Reasonably convenient though it was to be able to set up arrangements of commu- nicating SETL programs in this way, the languages such as the Bourne and C shell used for interconnecting the various programs were still nothing more than thin descendants of the job control languages of yore. Since SETL was already so competent at handling data, it was only natural to extend it with facilities for process creation and communica- tion. This led to the one-shot filter, unidirectional pipe, and bidirectional pump models for communication described in Chapter 2 of this dissertation. The fact that a writea to a reader’s reada can as easily transfer a large map as a small integer, combined with the fact that the Unix command rsh can be used to launch v tasks on remote processors and communicate with them, made this a very comfortable programming environment for distributed programming, though not able to express a TCP/IP (TCP or UDP) server or client directly. In 1994, the World Wide Web “arrived”, and it became clear that TCP/IP was to become firmly established as a global standard—the MIME-conveying HTTP protocol rides upon TCP streams, and the namespace-structuring Universal Resource Locator (URL) convention uses host names that map to IP addresses through the auspices of the widely used Domain Name Service (DNS). I therefore decided to build support for TCP/IP directly into the SETL I/O system, such that opening and using a bidirectional TCP communications stream in SETL would be as easy and natural as opening and using a file. How this is done is detailed in Chapter 3, which also describes SETL’s programmer-friendly support for UDP datagrams. The ability to code servers in SETL has proven to be even more useful than I pre- dicted. Servers act as the primary objects in server hierarchies. They bear state and control access to that state through message-passing protocols with child processes which in turn deal with clients. A server tends to keep track of such children with a dynamically varying map. Between the server and these trusted, proximal children, communication is safe and quick, minimizing the risk of the server becoming a bottle- neck. The WEBeye study of Chapter 4 illustrates this pattern. The liberal use of processes turns out to be beneficial time and again. Real work tends to fall to simple modules which communicate in a primitive way through their standard input and output channels, and these modules can easily be written in more efficiency-oriented languages than SETL where necessary. Small components are also vi easy to isolate for special or unusual testing, or for those rare but inevitable episodes called debugging. Overall, systems designed as process-intensive server hierarchies tend to acquire a satisfying dataflow feel. Not only is this in the spirit of Unix filters, it also dovetails with SETL’s value semantics, which abhor a pointer and cherish a copy, and in so doing avoid the hazard of distributed dangling references. vii Contents Dedication ii Acknowledgments iii Preface iv 1 Introduction 1 1.1 Why SETL? . .............................. 1 1.2ABriefHistoryofSETL......................... 6 1.3Summary................................. 23 2 Environmentally Friendly I/O 24 2.1InvocationEnvironment.......................... 25 2.2 Filters, Pipes, and Pumps . ................... 27 2.2.1 Filters . .............................. 29 2.2.2 Pipes . .............................. 30 2.2.3 Pumps . .............................. 32 2.2.4 Buffering . .......................... 34 2.2.5 Line-Pumps . .......................... 36 2.3 Sequential and Direct-Access I/O . ................... 37 2.3.1 Sequential Reading and Writing . ............ 38 2.3.2 String I/O . .......................... 40 2.3.3 Direct-Access Files . ................... 41 viii 2.4SignalsandTimers............................ 42 2.4.1 Signals .............................. 42 2.4.2 Timers .............................. 44 2.5 Multiplexing with Select ......................... 45 2.6Files,Links,andDirectories....................... 46 2.7 User and Group Identities . ................... 51 2.8 Processes and Process Groups . ................... 54 2.9 Open Compatibility . .......................... 56 2.10 Automatically Opened Files . ................... 57 2.11 Opening Streams Over File Descriptors . ............ 57 2.12 Passing File Descriptors . ................... 58 2.13 Normal and Abnormal Endings . ................... 60 2.14 Strings . .............................. 63 2.14.1 Matching by Regular Expression . ............ 63 2.14.2 Formatting and Extracting Values . ............ 66 2.14.3 Printable Strings . ................... 70 2.14.4 Case Conversions and Character Encodings . ..... 72 2.14.5 Concatenation and a Note on Defaults . ............ 74 2.15 Field Selection Syntax for Maps . ................... 75 2.16 Time . .............................. 76 2.17 Low-Level System Interface . ................... 78 2.17.1 I/O and File Descriptors . ................... 78 2.17.2 Processes . .......................... 80 2.18 Summary . .............................. 83 3 Internet Sockets 84 3.1ClientsandServers............................ 85 3.1.1 A Client .............................. 86 3.1.2 A Server . .......................... 88 3.1.3 Choosing the Port Number . ................... 90 ix 3.2ConcurrentServers............................ 93 3.2.1 Na¨ıveServer........................... 93 3.2.2 Shell-Aided Server . ................... 95 3.2.3 Shell-Independent Server . ................... 97 3.2.4 Pump-Aided Server . ................... 98 3.3DefensiveServers.............................101 3.3.1 Time-Monitoring Server . ...................101 3.3.2 Identity-Sensitive Server . ...................104 3.4UDPSockets...............................107 3.5Summary.................................112 4