<<

Efficiency Considerations of PERL and Python in Distributed Processing

Roger Eggen (presenter) Maurice Eggen and Information Sciences Department of Computer Science University of North Florida Trinity University Jacksonville, FL 32224 San Antonio, TX 78212 [email protected] [email protected] 904.620.1326 210.999.7487 904.620.2988 (fax) 210.999.7477 (fax)

Abstract: With the resurgence of interest in distributed processing and programming of distributed systems, researchers are exploring methods and techniques for facilitating programming of such systems. In this paper we demonstrate the ease of use and performance characteristics of socket communication for scripting languages PERL and Python. Productivity is enhanced when developing distributed applications while considering the performance tradeoffs incurred as compared to other popular languages.

Keywords: PERL, Python, , Distributed and Parallel Processing

1 Introduction to achieve the solution of a problem. The are considered loosely Distributed and parallel processing coupled if they do not share memory and belong to one of the most important have individual processors. Important areas of research in our modern programming strides have been made in computing world. From the Web to large recent years to provide applications scale numerical algorithms, distributed programming environments that and parallel methods dominate distributed computing more common distributed processing applications. It is and easier to implement than in the past. important for researchers and software The message passing interface (MPI), engineers to have a wealth of tools parallel virtual machine (PVM), available to them for programming such Common Object Request Broker distributed systems. In this paper, we Architecture (CORBA), Java’s Remote advocate the use of easy parallel Method Invocation (RMI), PERL’s algorithm development languages such SOAP interface, and others have as PERL and Python. We compare the extended ordinary languages to provide speed and efficiency of these vehicles methods (functions, procedures) with Java, is widely used for these facilitating distributed computing. PERL applications. and Python are not often used in distributed applications, but these 2 Fundamentals languages are extremely easy to program and have features that support parallel A distributed system is a collection of programming and Web interfaces independent computers interconnected rivaling any available language. by a network and capable of cooperating In this paper, we study the basis of all • Generic algorithms Web and distributed communication • Graphing and charting over a network, that of sockets. We • Image processing compare the socket communication of • Mathematical and statistical PERL and Python to Java for ease of programming development and overall efficiency. • Natural language processing (in English, Chinese, Japanese, and 3 PERL Finnish, among others) • Network is an open source, cross-platform • Operating-system integration capable of with Windows, Solaris, , performing a wide variety of Mac OS, etc. applications. The word PERL is an • PERL/Apache integration acronym for “Practical Extraction and • Spam identification Report Language.” Used in both the • Software testing public and private sectors, PERL is • Templating systems supported by , Macintosh, • Prototyping for fast development Windows and many more operating • Text processing systems. PERL comprises the best • Web services, Web clients, and features of many languages, including , Web servers , , sh and Basic. PERL works • XML/HTML processing [1] with HTML, XML, and other markup languages to support and both The following code performs the socket procedural and object oriented communication in Perl: paradigms. PERL is extensible. Because my $host2 = 'node2' of its excellent text processing my $port2 = 2008; capabilities, PERL is perhaps the most my $protocol2 = widely used Web programming getprotobyname('tcp'); socket(SOCK2,AF_INET,SOCK_STREAM, language. Applications written using the $protocol1) PERL language include, but are not my $dest_addr2 = limited to the following: sockaddr_in($port2,$host2); connect(SOCK2,$dest_addr2) print SOCK2 $x; • Application servers • Artificial intelligence algorithms 4. Python • Astronomy • Audio Python is an easy to learn, powerful • programming language. Efficient high- • Compression and encryption level data structures lead to a simple, but • Content management systems (for both small and large scale effective approach to object-oriented Web sites) programming. Python's elegant syntax and dynamic typing together with its • Database interfaces interpreted nature make it an ideal • Date/Time Processing language for scripting and rapid • eCommerce application development in many areas • Email processing and on most platforms. • GUI development The Python and the extensive 6 Software standard library are freely available in source or binary for all major The software consists of Python 2.2.2, platforms from the Python Web site. [2] RedHat 9 Linux core 3.2.2-4, PERL The same site also contains distributions 5.8.3 installed with threading, and Java of and pointers to many free third party HotSpot 1.4.1_01. Only Linux machines Python modules, programs and tools, were involved to remove discrepancies and additional documentation. Python is caused by different operating systems. object oriented with a class structure similar to Java. In many respects, Python 7 Programming a Distributed has the best features of Java and avoids much of the , particularly in the System input/output category. Python supports , polymorphism, and Programming a distributed system, all the other features expected of an regardless of the programming object oriented language. environment used, involves several fundamentals. Typically, one of three The Python interpreter is easily extended scenarios are used: boss/worker where with new functions and data types the boss distributes a portion of the work implemented in C or C++ (or other uniquely, worker crew where each languages callable from C). Python is processor does essentially the same also suitable as an extension language processing on different portions of the for customizable applications. [2] data, often referred to as single instruction multiple data (SIMD), and a The following code performs the socket pipeline where each worker processes a communication in Python: portion of the data, passing that partial solution to the next worker. In this HOST = 'node2' # The remote host research we use the boss/worker PORT = 5002 # port for server environment as shown in Figure 1. A s = socket.socket(socket.AF_INET,socket. boss process defines the task to be SOCK_STREAM) accomplished and decides the method of s.connect((HOST, PORT)) distribution. Several worker processes st = a.tostring() s.send(st) do the actual work. The worker starts by listening to the socket for the arrival of Again, the communication using Python work from the boss. The boss is the is easy and direct. controlling process that has the responsibility of dividing the task 5 Hardware appropriately. A portion of the task is sent to each of the workers. The boss

waits to receive the completed tasks. The hardware for evaluating socket Each of the workers execute in parallel. efficiency consists of a Beowulf cluster They each receive their task from the of computers all running RedHat linux boss without interaction with other v9. The machines are 0.5 gHz machines workers. with 512 megabytes of main memory connected by gigabit fast ethernet.

results. Timings included dividing and Worker Worker ... Worker distributing the data, sorting each data , receiving the sorted results from each worker, and merging to arrive at a totally sorted set of data.

Boss 9 Results

Figure 1 Datasets of sizes 1000, 2000, 4000, Communication Dependency 8000, 16000, 32000, 64000, 128000, and 256000, were sorted and timed. Since 8 Evaluation the machines used in this experiment were not dedicated to the researchers,

times were taken at 3:00AM when The purpose of this research is to network traffic and users were at a determine how much a user gains (or minimum. These machines are lightly gives up) to enjoy the ease of used, so unplanned impact on programming languages PERL and communication is virtually nonexistent. Python in distributed applications. We The authors have carefully checked the chose a simple . Sorting algorithms, each is correct and is well understood and ubiquitous in implemented in a consistent manner. The computing, so it seemed a reasonable timing results are in Figure 2. place to start. We programmed an n2 sort so reasonable time would be taken on We note that the efficiency of PERL is each of the server machines, creating a significantly less than that of Java. A measure of both communication and variety of considerations were employed computation times. Boss and workers are to increase PERL’s efficiency, each with programmed using Java, PERL, and limited success. PERL 5.8.0 ran 90% as Python. Each application is implemented fast as PERL 5.8.3, so one should with exactly the same functionality. The certainly ensure the latest version of boss machine creates sets of integers PERL is used. Also, threading is whose cardinality vary between 1,000 relatively new to PERL, only coming and 256,000. Each set was then evenly into existence in the 5.8.x versions. divided and distributed among the Threaded PERL 5.8.3 runs 96% as fast worker machines. Each worker sorted as a non-threaded installation. In a their portion of the data, yielding a distributed application, installation of a highly parallel solution. The threaded version of PERL is essential. communication from boss to worker to boss was handled through native sockets. The PERL application can be compiled Sockets provide the foundation of all to native code using perlcc. Perlcc is in a Web applications and consequently “nonstable” state and should not be used impact the performance of any Web for production applications, but it did based process. generate native code. However, the

native code executed only 92% as fast as As is often the case, the boss did not the code directly interpreted. We participate in the primary task, but rather distributed the work and waited for a significant increase in performance qualitative, we believe our experience is when perlcc is enhanced. suggestive of the nature one can expect when considering these environments. The authors were surprised to see the Upon evaluation of our respective performance comparison between experiences in each environment, we Python, PERL, and Java. We expected believe time was most Python to perform much better when efficient in Python, Perl a close second, compared to the other languages. and development of the Java application approximately 30% longer. We realize A less defined measure is programmer this consideration is highly arbitrary, but productivity. We carefully monitored believe programmer productivity is a our development time for each of the significant factor in cost of application applications. While this is somewhat development. arbitrary, ill defined, and rather

Processors 2 4 8

Data Size PERL Python Java PERL Python Java PERL Python Java 1000 0.943 0.914 0.247 0.59 0.341 0.142 0.579 0.087 0.212 2000 2.271 3.616 0.284 1.155 1.308 0.199 0.719 0.331 0.257 4000 8.09 21.174 0.367 3.357 5.261 0.23 1.276 1.312 0.268 8000 31.046 84.569 0.546 12.058 21.174 0.236 3.49 5.291 0.289 16000 122.327 338.504 1.474 46.338 84.735 0.488 12.243 21.305 0.338 32000 490.912 1353.233 4.931 178.554 339.36 1.411 46.958 85.446 0.553 64000 1983.205 5426.464 17.973 682.562 1354.352 4.837 180.644 339.825 1.468 128000 7984.755 21684.42 70 2553.147 5414.646 17.68 693.55 1362.639 4.921 256000 31653.09 87170.28 308.03 9330.001 21712.73 68.136 2568.75 5456.379 17.923

Figure 2 Efficiency Timings

10 Summary and Conclusions Java’s socket communication is If the fastest application processing efficient, which accounts for some of its possible is the goal, PERL, Python, or performance. Also, interpreting byte Java would not be the . PVM or code is more efficient than pure MPI with native routines will interpretation. Note, each application outperform all the above applications. scales appropriately with the size of the However, when developer productivity, data set. ease of programming, portability, and prototyping are desired, PERL and 11 Future Research Python are viable alternatives. PERL and Python support a robust and convenient In JavaSpace’s Linda-like environment, API for implementing distributed and the efficiency in a non-reliable Web based applications. There are many networking environment should be Web based applications programmed in studied to determine the impact of a PERL, since it is easy to implement. loaded network or failed processes. Part of the advantages of JavaSpaces is its ability to recover from failure of processes, processors, or the network.

PERL provides a SOAP-Lite Web based environment. Ease of programming and efficiency of applications development should be considered.

12 Bibliography

[1] http://www.PERL.org/

[2] http://www.python.org

[3] Wall, L., Christiansen, T., and Orwant, J. Programming Perl. O’REILLY (2000).

[4] Liu, M. Distributed Computing. Addison Wesley. Boston (2004).

[5] Stein, L. Network Programming with Perl. Addison Wesley. Boston (2001).

[6] Christiansen, T. and Torkington, N. Perl Cookbook. O’REILLY. Sebastopol, CA.(1998).

[7] Schwartz, . Speeding up Your Perl Programs. Sys Admin CMP Media LLC. November 2003, pp. 43-44.