Holistic Analysis for Refactoring Python-Based Analytics Programs
Total Page:16
File Type:pdf, Size:1020Kb
HARP: Holistic Analysis for Refactoring Python-Based Analytics Programs Weijie Zhou Yue Zhao [email protected] [email protected] North Carolina State University Facebook Raleigh, NC Menlo Park, CA Guoqiang Zhang Xipeng Shen [email protected] [email protected] North Carolina State University North Carolina State University Raleigh, NC Raleigh, NC ABSTRACT 42nd International Conference on Software Engineering (ICSE ’20), May 23– Modern machine learning programs are often written in Python, 29, 2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 12 pages. with the main computations specified through calls to some highly https://doi.org/10.1145/3377811.3380434 optimized libraries (e.g., TensorFlow, PyTorch). How to maximize the computing efficiency of such programs is essential for many 1 INTRODUCTION application domains, which has drawn lots of recent attention. This For machine learning applications, computing efficiency is essential, work points out a common limitation in existing efforts: they focus for the growing scale of datasets and the needs for timely responses their views only on the static computation graphs specified by li- in various applications. A pattern common in today’s analytics brary APIs, but leave the influence from the hosting Python code applications is Python+X, where, main computations are written largely unconsidered. The limitation often causes them to miss the with APIs of some libraries X (e.g., TensorFlow, PyTorch) while the big picture and hence many important optimization opportunities. host code is written in Python which connects all pieces together. This work proposes a new approach named HARP to address the We call them Python-based analytics programs. problem. HARP enables holistic analysis that spans across com- Such programs have some common features. We draw on Tensor- putation graphs and their hosting Python code. HARP achieves it Flow [1] as an example for explanation. Like some other packages, through a set of novel techniques: analytics-conscious speculative TensorFlow was initially introduced for a specific scope (Deep analysis to circumvent Python complexities, a unified representa- Learning), but then evolved for a broader scope (Analytics and Sci- tion augmented computation graphs to capture all dimensions of entific Computing). At its core, TensorFlow builds on the dataflow knowledge related with the holistic analysis, and conditioned feed- programming model. In developing a TensorFlow program, a devel- back mechanism to allow risk-controlled aggressive analysis. Refac- oper writes code by calling TensorFlow APIs to specify the intended toring based on HARP gives 1.3–3X and 2.07X average speedups computations, and makes a call to the TensorFlow runtime. The call on a set of TensorFlow and PyTorch programs. triggers the execution of those APIs, which constructs a computa- tion graph; the TensorFlow runtime optimizes the graph, and then CCS CONCEPTS executes it. This approach represents one of the popular program- • Software and its engineering → Automated static analysis; ming paradigms used by modern machine learning and analytics Data flow architectures; • Computing methodologies → Machine frameworks. learning. The recent years have seen a number of efforts for enhancing KEYWORDS the performance of Python-based analytics programs. For exam- ple, the XLA project [18] and the R-Stream-TF project [31] try to machine learning program, computation graph, dynamic language, improve the runtime performance of TensorFlow by utilizing com- program analysis piler techniques to transform dataflow graphs (e.g., fusing multiple ACM Reference Format: operations). Weijie Zhou, Yue Zhao, Guoqiang Zhang, and Xipeng Shen. 2020. HARP: However, despite the many efforts, a large room for performance Holistic Analysis for Refactoring Python-Based Analytics Programs. In improvement is left elusive for current optimizers to harvest. We be- lieve that one of the fundamental reasons is that current efforts have Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed all focused on the dataflow graphs embodied by the invocations for profit or commercial advantage and that copies bear this notice and the full citation of library APIs in a program. Although dataflow graphs usually on the first page. Copyrights for components of this work owned by others than ACM capture the core computations of an analytics program, limiting the must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a view on them may lose the big picture that the host code provides, fee. Request permissions from [email protected]. and hence many large-scoped optimization opportunities. ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Listing 1 offers an example. This codelet is part of the core com- © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7121-6/20/05...$15.00 putations in Deep Dictionary Learning [24]. The first five lines of https://doi.org/10.1145/3377811.3380434 the code specify the core computations and hence the structure of ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Weijie Zhou, Yue Zhao, Guoqiang Zhang, and Xipeng Shen the dataflow graph. Line 1 defines 퐴 as a variable as its value needs Listing 1: Example from Deep Dictionary Learning ("tf" for to be updated through the execution; Lines 2 and 3 define 퐷 and - the namespace of TensorFlow) as place holders for they will hold the input values; Line 4 specifies 1 A = tf.Variable(tf.zeros(shape=[N, N]) , dtype=tf.float32) the formula for calculating the new value of 퐴 through a series of 2 D = tf.placeholder(shape=[N, N], dtype=tf.float32) calls to the high-level TensorFlow mathematics functions; Line 5 3 X = tf.placeholder(shape=[N, N], dtype=tf.float32) 4 R = tf.matmul(D, tf.subtract(X, tf.matmul(tf.transpose(D) specifies an update to 퐴 with the new value. Line 6 is a Python ,A))) loop statement, and Line 7 calls TensorFlow API "sess.run" (a block- 5 L = tf.assign(A, R) ing API) to invoke TensorFlow runtime to actually construct the 6 for i in range ( I t e r ) : 7 result = sess.run(L, feed_dict={D: D_, X: X_}) dataflow graph and execute it; it passes the actual parameters 퐷_ and -_ to the two place holders. The following pseudo-code shows what the codelet implements: 5 >A 8 < 퐼C4A : Listing 2: Listing 1 in a Different Form 퐴 = 퐷 ¹- − 퐷) 퐴º 1 . 2 t1 = tf.matmul(D, X) where each of the capitalized letters represents a multidimensional 3 t2 = tf.matmul(D, tf.transpose(D)) array (called a tensor), and 퐷 and - are input tensors and remain 4 R = tf.substract(t1, tf.matmul(t2, A)) 5 L = tf.assign(A, R) constant across the loop iterations. 6 for i in range ( I t e r ) : The implementation by the codelet in Listing 1 contains some 7 result = sess.run(L, feed_dict={D: D_, X: X_}) redundant computations, which can be seen if we expand the math- ematical formula in the pseudo-code to 퐷- −퐷퐷) 퐴. Because terms and ) are invariant across the loop iterations, they can be 퐷- 퐷퐷 dynamically-typed language, the complexities of which (listed later) hoisted outside the loop, computed once and reused for all iterations form some major barriers for static optimizers to work effectively. as shown as follows: A pure dynamic optimizer on the other hand is complicated to C1 = 퐷- develop and causes runtime overhead and delays. ) C2 = 퐷퐷 The strategy we finally took is to create a refactoring assistant, 5 >A 8 < 퐼C4A : which provides suggestions for developers to use in refactoring 퐴 = C1 − C2퐴 the code. This strategy allows aggressive analysis on incomplete Even though the redundant computations are not hard to see information. Even though the suggestions may not be always cor- at the pseudo-code level, it cannot be recognized or optimized by rect, as the tool provides enough feedback on the assumptions it either TensorFlow or any of the previously proposed optimizers uses, the developers can avoid the risks while taking advantage of designed for TensorFlow programs. Moreover, even if we rewrite the often correct suggestions. HARP is the first tool that enables the code to better expose the redundancy as in Listing 2, prior holistic analysis that spans across dataflow graphs and their hosting optimizers still cannot detect the redundant computations. Python code. HARP achieves it through several novel features: The reason is that all these tools focus only on the dataflow graph (1) Speculative analysis. HARP relies heavily on static analysis. composed through the TensorFlow APIs, while the redundancy As a dynamically-typed scripting language, Python poses many comes from the interplay between those TensorFlow API calls and difficulties to static analysis, such as unknown types, operator over- the other Python code that hosts those calls. Take Listing 1 for loading, dynamic dispatch of higher-order functions, and so on. instance, without analyzing the host Python code, it is impossible HARP circumvents the complexities with two key insights: Many to tell that D and X are invariant across the for loop and hence Python complexities rarely appear in analytics programs or can ) 퐷- and 퐷퐷 are also loop invariants, as the computation graph be ignored; for those that do matter, they can often be treated suc- does not capture the loop in the host code and its relation with cessfully through speculations on common patterns in analytics the data in the graph. On the other hand, general Python code applications. HARP materializes the insights through speculative optimizers cannot find out the redundancy either as they donot analysis. understand the necessary semantics of the TensorFlow APIs.