<<

DimmWitted: A Study of Main•Memory Statistical Analytics

Ce Zhang†‡ Christopher Re´ † †Stanford University ‡University of Wisconsin•Madison {czhang, chrismre}@cs.stanford.edu

ABSTRACT these methods are the core algorithm in systems such as ML- We perform the first study of the tradeoff space of access lib, GraphLab, and Google Brain. Our study examines an- methods and replication to support statistical analytics us- alytics on commodity multi-socket, multi-core, non-uniform ing first-order methods executed in the main memory of a memory access (NUMA) machines, which are the de facto Non-Uniform Memory Access (NUMA) machine. Statistical standard machine configuration and thus a natural target for analytics systems differ from conventional SQL-analytics in an in-depth study. Moreover, our experience with several the amount and types of memory incoherence that they can enterprise companies suggests that, after appropriate pre- tolerate. Our goal is to understand tradeoffs in accessing the processing, a large class of enterprise analytics problems fit in row- or column-order and at what granularity one into the main memory of a single, modern machine. While should share the model and data for a statistical task. We this architecture has been recently studied for traditional study this new tradeoff space and discover that there are SQL-analytics systems [9], it has not been studied for sta- tradeoffs between hardware and statistical efficiency. We tistical analytics systems. argue that our tradeoff study may provide valuable infor- Statistical analytics systems are different from traditional mation for designers of analytics engines: for each system SQL-analytics systems. In comparison to traditional SQL- we consider, our prototype engine can run at least one pop- analytics, the underlying methods are intrinsically robust to ular task at least 100× faster. We conduct our study across error. On the other hand, traditional statistical theory does five architectures using popular models, including SVMs, lo- not consider which operations can be efficiently executed. gistic regression, Gibbs sampling, and neural networks. This leads to a fundamental tradeoff between statistical ef- ficiency (how many steps are needed until convergence to a given tolerance) and hardware efficiency (how efficiently those steps can be carried out). 1. INTRODUCTION To describe such tradeoffs more precisely, we describe the Statistical data analytics is one of the hottest topics in setup of the analytics tasks that we consider in this paper. data-management research and practice. Today, even small The input data is a matrix in RN×d and the goal is to find organizations have access to machines with large main mem- a vector x ∈ Rd that minimizes some (convex) loss function, ories (via Amazon’s EC2) or for purchase at $5/GB. As a say the logistic loss or the hinge loss (SVM). Typically, one result, there has been a flurry of activity to support main- makes several complete passes over the data while updating memory analytics in both industry (Google Brain, Impala, the model; we call each such pass an epoch. There may be and Pivotal) and research (GraphLab, and MLlib). Each some communication at the end of the epoch, e.g., in bulk- of these systems picks one design point in a larger tradeoff synchronous parallel systems such as Spark. We identify space. The goal of this paper is to define and explore this three tradeoffs that have not been explored in the literature: space. We find that today’s research and industrial systems (1) access methods for the data, (2) model replication, and under-utilize modern commodity hardware for analytics— (3) data replication. Current systems have picked one point sometimes by two orders of magnitude. We hope that our in this space; we explain each space and discover points study identifies some useful design points for the next gen- that have not been previously considered. Using these new eration of such main-memory analytics systems. points, we can perform 100× faster than previously explored Throughout, we use the term statistical analytics to refer points in the tradeoff space for several popular tasks. to those tasks that can be solved by first-order methods–a class of iterative algorithms that use gradient information; Access Methods. Analytics systems access (and store) data in either row-major or column-major order. For example, systems that use stochastic gradient descent methods (SGD) This work is licensed under the Creative Commons Attribution• access the data row-wise; examples include MADlib [13] NonCommercial•NoDerivs 3.0 Unported License. To view a copy of this li• in Impala and Pivotal, Google Brain [18], and MLlib in cense, visit http://creativecommons.org/licenses/by•nc•nd/3.0/. Obtain per• Spark [33]; and stochastic coordinate descent methods (SCD) mission prior to any use beyond those covered by the license. Contact access the data column-wise; examples include GraphLab [21], copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 40th International Conference on Shogun [32], and Thetis [34]. These methods have essen- Very Large Data Bases, September 1st • 5th 2014, Hangzhou, China. tially identical statistical efficiency, but their wall-clock per- Proceedings of the VLDB Endowment, Vol. 7, No. 12 formance can be radically different due to hardware effi- Copyright 2014 VLDB Endowment 2150•8097/14/08.

1283 (a) Memory Model for DimmWitted! (b) Pseudocode of SGD! (c) Access Methods! (d) NUMA Architecture!

Data A! Model x! Input: Data matrix A, stepsize "! Row-wise! Col.-wise! Col.-to-row! Gradient function grad.! Machine! Output: Model m.! Node 1! Node 2! d! Procedure:! RAM! RAM! N! m <-- initial model! L3 Cache! L3 Cache! while not converge! L1/L2! L1/L2! L1/L2! L1/L2! foreach row z in A:! Read set of data! Core1! Core2! Core3! Core4! Epoch! d! m <-- m – " grad(z, m)! Write set of model! test convergence !

Figure 1: Illustration of (a) DimmWitted’s Memory Model, (b) Pseudocode for SGD, (c) Different Statistical Methods in DimmWitted and Their Access Patterns, and (d) NUMA Architecture. ciency. However, this tradeoff has not been systematically state. The PerMachine method is statistically efficient studied. To study this tradeoff, we introduce a storage ab- due to high communication rates, but it may cause straction that captures the access patterns of popular statis- contention in the hardware, which may lead to subop- tical analytics tasks and a prototype called DimmWitted. timal running times. In particular, we identify three access methods that are used in popular analytics tasks, including standard supervised • A natural hybrid is PerNode; this method uses the models such as SVMs, logistic regression, fact that PerCore communication through the last-level and least squares; and more advanced methods such as neu- cache (LLC) is dramatically faster than communica- ral networks and Gibbs sampling on factor graphs. For dif- tion through remote main memory. This method is ferent access methods for the same problem, we find that the novel; for some models, PerNode can be an order of time to converge to a given loss can differ by up to 100×. magnitude faster. We also find that no access method dominates all oth- ers, so an engine designer may want to include both access Because model replicas are mutable, a key question is how methods. To show that it may be possible to support both often should we synchronize model replicas? We find that it methods in a single engine, we develop a simple cost model is beneficial to synchronize the models as much as possible— to choose among these access methods. We describe a sim- so long as we do not impede throughput to data in main ple cost model that selects a nearly optimal point in our memory. A natural idea, then, is to use PerMachine sharing, data sets, models, and different machine configurations. in which the hardware is responsible for synchronizing the replicas. However, this decision can be suboptimal, as the cache-coherence protocol may stall a processor to preserve Data and Model Replication. We study two sets of trade- coherence, but this information may not be worth the cost offs: the level of granularity, and the mechanism by which of a stall from a statistical efficiency perspective. We find mutable state and immutable data are shared in analytics that the PerNode method, coupled with a simple technique tasks. We describe the tradeoffs we explore in both (1) mu- to batch writes across sockets, can dramatically reduce com- table state sharing, which we informally call model replica- munication and processor stalls. The PerNode method can tion, and (2) data replication. result in an over 10× runtime improvement. This technique depends on the fact that we do not need to maintain the (1) Model Replication. During execution, there is some model consistently: we are effectively delaying some updates state that the task mutates (typically an update to the to reduce the total number of updates across sockets (which model). We call this state, which may be shared among lead to processor stalls). one or more processors, a model replica. We consider three different granularities at which to share model replicas: (2) Data Replication. The data for analytics is immutable, • The PerCore approach treats a NUMA machine as a so there are no synchronization issues for data replication. distributed system in which every core is treated as an The classical approach is to partition the data to take ad- individual machine, e.g., in bulk-synchronous models vantage of higher aggregate memory bandwidth. However, such as MLlib on Spark or event-driven systems such as each partition may contain skewed data, which may slow GraphLab. These approaches are the classical shared- convergence. Thus, an alternate approach is to replicate the nothing and event-driven architectures, respectively. data fully (say, per NUMA node). In this approach, each In PerCore, the part of the model that is updated by node accesses that node’s data in a different order, which each core is only visible to that core until the end of means that the replicas provide non-redundant statistical an epoch. This method is efficient and scalable from information; in turn, this reduces the variance of the esti- a hardware perspective, but it is less statistically ef- mates based on the data in each replicate. We find that for ficient, as there is only coarse-grained communication some tasks, fully replicating the data four ways can converge between cores. to the same loss almost 4× faster than the sharding strategy.

• The PerMachine approach acts as if each processor has Summary of Contributions. We are the first to study the uniform access to memory. This approach is taken in three tradeoffs listed above for main-memory statistical an- Hogwild! and Google Downpour [10]. In this method, alytics systems. These tradeoffs are not intended to be an the hardware takes care of the coherence of the shared exhaustive set of optimizations, but they demonstrate our

1284 main conceptual point: treating NUMA machines as dis- Algorithm! Access Method! Implementation! tributed systems or SMP is suboptimal for statistical ana- Stochastic Gradient Descent! Row-wise! MADlib, Spark, Hogwild!! lytics. We design a storage manager, DimmWitted, that shows it is possible to exploit these ideas on real data sets. Column-wise! Stochastic Coordinate Descent! GraphLab, Shogun, Thetis! Finally, we evaluate our techniques on multiple real datasets, Column-to-row! models, and architectures. Figure 2: Algorithms and Their Access Methods. 2. BACKGROUND In this section, we describe the memory model for DimmWitted, which provides a unified memory model to proposed for other popular methods, including Gibbs sam- implement popular analytics methods. Then, we recall some pling [15, 31], stochastic coordinate descent (SCD) [29, 32], basic properties of modern NUMA architectures. and linear systems solvers [34]. This technique was applied by Dean et al. [10] to solve convex optimization problems with billions of elements in a model. This memory model is Data for Analytics. The data for an analytics task is a distinct from the classical, fully coherent database execution. pair (A, x), which we call the data and the model, respec- The DimmWitted prototype allows us to specify that a tively. For concreteness, we consider a matrix A ∈ RN×d. region of memory is coherent or not. This region of memory In machine learning parlance, each row is called an exam- may be shared by one or more processors. If the memory ple. Thus, N is often the number of examples and d is often is only shared per thread, then we can simulate a shared- called the dimension of the model. There is also a model, nothing execution. If the memory is shared per machine, we typically a vector x ∈ Rd. The distinction is that the data can simulate Hogwild!. A is read-only, while the model vector, x, will be updated during execution. From the perspective of this paper, the important distinction we make is that data is an immutable Access Methods. We identify three distinct access paths matrix, while the model (or portions of it) are mutable data. used by modern analytics systems, which we call row-wise, column-wise, and column-to-row. They are graphically il- First•Order Methods for Analytic Algorithms. DimmWit- lustrated in Figure 1(c). Our prototype supports all three access methods. All of our methods perform several epochs, ted considers a class of popular algorithms called first-order that is, passes over the data. However, the algorithm may methods. Such algorithms make several passes over the data; iterate over the data row-wise or column-wise. we refer to each such pass as an epoch. A popular exam- ple algorithm is stochastic gradient descent (SGD), which • In row-wise access, the system scans each row of the is widely used by web-companies, e.g., Google Brain [18] table and applies a function that takes that row, ap- and VowPal Wabbit [1], and in enterprise systems such as plies a function to it, and then updates the model. Pivotal, Oracle, and Impala. Pseudocode for this method This method may write to all components of the is shown in Figure 1(b). During each epoch, SGD reads a model. Popular methods that use this access method single example z; it uses the current value of the model and include stochastic gradient descent, gradient descent, z to estimate the derivative; and it then updates the model and higher-order methods (such as l-BFGS). vector with this estimate. It reads each example in this loop. After each epoch, these methods test convergence (usually • In column-wise access, the system scans each column j by computing or estimating the norm of the gradient); this of the table. This method reads just the j component computation requires a scan over the complete dataset. of the model. The write set of the method is typically a single component of the model. This method is used 2.1 Memory Models for Analytics by stochastic coordinate descent. DimmWitted We design ’s memory model to capture the • In column-to-row access, the system iterates conceptu- trend in recent high-performance sampling and statistical ally over the columns. This method is typically applied methods. There are two aspects to this memory model: the to sparse matrices. When iterating on column j, it coherence level and the storage layout. will read all rows in which column j is non-zero. This method also updates a single component of the model. Coherence Level. Classically, memory systems are coher- This method is used by non-linear support vector ma- ent: reads and writes are executed atomically. For analytics chines in GraphLab and is the de facto approach for systems, we say that a memory model is coherent if reads Gibbs sampling. and writes of the entire model vector are atomic. That is, access to the model is enforced by a critical section. How- DimmWitted is free to iterate over rows or columns in es- ever, many modern analytics algorithms are designed for an sentially any order (although typically some randomness in incoherent memory model. The Hogwild! method showed the ordering is desired). Figure 2 classifies popular imple- that one can run such a method in parallel without locking mentations by their access method. but still provably converge. The Hogwild! memory model relies on the fact that writes of individual components are 2.2 Architecture of NUMA Machines atomic, but it does not require that the entire vector be We briefly describe the architecture of a modern NUMA updated atomically. However, atomicity at the level of the machine. As illustrated in Figure 1(d), a NUMA machine cacheline is provided by essentially all modern processors. contains multiple NUMA nodes. Each node has multiple Empirically, these results allow one to forgo costly locking cores and processor caches, including the L3 cache. Each (and coherence) protocols. Similar algorithms have been node is directly connected to a region of DRAM. NUMA

1285 System solving ecution access assume propriate argumen 3.1 example, mo for derstand within presen no b In mo 3. Figure trolled of quick- en/io/quickpath- 25.6GB/s. width timizer mac actions data NUMA 1 l l l www.intel.com/content/www/us/ ec ec oard; oc oc oc ( terconnects Figure ab W W N des 2.2 (e 2.1 (e eac a a a de de • DimmWitted ame l l l hines b 2 (l 8 (l 4 (l e e is THE l l f second v h describ t describ .) 2) 1) 2) 8) 4)

r System are transferred eac m that replication, path- b in ! on ow that a ! ! arc considers, some m mac use ! ! ! 3: y Input. ts; etho #N the supp w are o M t our h 1 Data 4: us captures yp c hitectures del lo Summa 2 2 2 8 4 e od ode the onnected hine subsection. ! ! ! ! ! T DIMMWITTED t is the arg interconnect- he e cal2 e d e e pro ha o Illustration Amazon statistical-v exp ! case, (QPIs), l A ort of a analytics #C

that access the x l first ! v N umen F isted ! user Over QPI. vide 10 e v or that od erimen 8 8 6 8 or DimmWitted ector ! ! ! ! across v in T ! e namely technology/ names e s tradeoff and ector ! h the the ry this / es solv is eac ! the

pro

are t N b DRAM

to

ab w Figure ted ! zer mi ti p EC2 O a od view of oth is whic RAM/ h e tal W es 122 128 of (3) connection vides o tasks e 30 64 32 p NUMA eac full cac ersus-hardw (G mac the v use Mac analytics with oin e ! ! ! ! ! e, the (1) with length a of results ro B) h ! configurations. describ space h data he intro ter v ! hine. inde ro w-wise there regions 3 in has data ersion Re in C

other hine DimmWitted’s

access same the

coheren w- D

l de oc !" pl !" to giv this no a DimmWitted STREAM C x k t Re replication. 2.6 2.6 2.0 2.6 2.6 i ENGINE E duction- scrib a that d P ca a (G in and e des

prefix A a xe s U e . Eac of ! a is task is ! ! ! ! ! ! d n implemen underlying b

bandwidth of access H W of ! and are mo pap c ! a ∈ metho y In ut a z a the an orke using t, ) es h T DimmWitted this column-wise other ! i R U single on P buses del. function (M r that er. LLC tradeoff addition, and pda N metho ade “lo analytics 20 20 24 12 24 the Memory r In ! × B) ! ! ! ! ! metho Re l t paper.html pap d M e a d tel ! the cal”; ! offs ! Mac n NUMA [6]. the ode pl ! and w tatio selection, co ro on T i d

e er.

Quic ca

and mo QPI. nfiguration

w.

o

l E

as

hines paragraph ! ne hi ac coherency M ! ! study space, of tak d, Q the the Worker Worker

ng ! ! an help n l 6G 6G P for del. task. oc high I 1 Band- th k and w an details es no ine. a B/ B/ initial These ’s l other a 2 main P e 1G s s eac , con- ! y ! ! des, t F un- a o ap- ex- (2) w w w B/ its th as or p- of

h RAM RAM

o ! ! e e s ! 1286 tion column-wise, efficiency 3.2 metho con statistic w Figure use (HW), data Eac and or ceiv mo f memory Figure single core tions erate erate access are replicas con illustrated Execution. 2 ctr (3) (1) (2) Define e In more d • • v tains h illustrate e describ DimmWitted if is erge, in time this Data Mo assigned a tuples or ping, f and second its either a f non-zero Existing a y of Access a on, an v d ctr col Access metho a ariable w 5 the mo al set the p w S tak write the of Spark 5: pair second that either ork del execution oin section, summarizes (2) ork ( than assign captures of captures and efficiency del j ed in data di replication, es mac of whole ) Re Re the A er ter funct for M Giv T argumen ers d an sj replication, ( = a Access the Figure are M and r pl ph sp to the pl j, ro D to of ad metho en oin nee e used (2) Summary on i ode f replica hine: t i a Method ca { in ro (SP)). argumen eac e hods ca S ma en w Sy access replicas and t ysically; col e finish tries th cific a i off w ti ions distinct lo ti ( t,

tradeoff w l ds mo ’s ! mo statistic column-to-ro on ! j : ! on place. e har indexes. h e s t y a ! cal ! plan. )). , the or a or to he ! tems or en ! examine mo to ation 4, mo b mo ij del w i.e., (1) del, t F ds, in Col dwar mo the e f gine Col ul some ork up of to P a is column column-wise metho 6= ctr read sp S de column-to-ro Row mapp e l dels del P S P a t of um Re tr r M in replica single a e difies um ha i.e., e the date the particular the while An lo ecifies er. r N . Ho l. 0 is ate r Core al data in Selection tradeoff along in e n-t but subset the rdi (GraphLab pl of } -w w ac to n-w Note calit sp ode W a . gi i com or These efficiency o-row w lo efficiency hic ca ng hi i whic eac in whic d mo ed execution s they e The n ecification, pair F explore i e ne the ev e c ti DimmWitted’s s ! mo ! s ! um metho w. t the ! dex ep ! write ality matrix on or e dep h f ypically y call ! that ! h bination. to er, three del col t ! tha ! o w h del h w of groups b access mo a replicas Exi c of W corresp eac mo e space. ends o of er ro w h. the in and w NUMA access column to t w the G gr c e the it e s one d to del. ws . axes, ho e L ti a , a h SP d three of G things oups for c , S find our n access up to is el HW L ho ab GL the single g S can plan, , H eac lo mo ose not data f more , S P metho mo ep corresp (GL), our column ma , H assigned. ctr calit date, to W ond access W a ! ose o ys ! P ma , In h metho del study, v o ! that namely W a ! te single tradeoffs: e time del, no j b e select y c as whic w only ms goal for , ! su hs matrix call oth. sc ho y tuple y metho eac column. ork on to o des, sp the d: ! the bset hematically and v group. b ond w it T it the Hogwild! the erlap. eac ecification h h e index d, regions that er. the ro hardw is radeoffs column. mo f to will input b tak o and they r replicas section, of w-wise, (1) h and (3) to v of ow et d, data. execu- to to create When dify e set func- CPU w es rlap- data eac read gen- The and and can one een the W the the op- are re- its to to of of h a 2 e 1287 1288 Size! Size! Model! Dataset! #Row! #Col.! NNZ! Sparse! statistical models and datasets. For statistical models, we (Sparse) ! (Dense)! choose five models that are among the most popular models RCV1! 781K! 47K! 60M! 914MB! 275GB! !! SVM! used in statistical analytics: (1) Support Vector Machine ! Reuters! 8K! 18K! 93K! 1.4MB! 1.2GB! !! LR! ! Music! 515K! 91! 46M 701MB 0.4GB (SVM), (2) Logistic Regression (LR), (3) Least Squares LS! ! ! ! Forest! 581K! 54! 30M! 490MB! 0.2GB! Regression (LS), (4) Linear Programming (LP), and (5) Amazon! 926K! 335K! 2M! 28MB! >1TB! !! Quadratic Programming (QP). For each model, we choose LP! Google! 2M! 2M! 3M! 25MB! >1TB! !! datasets with different characteristics, including size, spar- Amazon! 1M! 1M! 7M! 104MB! >1TB! !! QP! sity, and under- or over-determination. For SVM, LR, and Google! 2M! 2M! 10M! 152MB! >1TB! !! LS, we choose four datasets: Reuters4, RCV15, Music6, and Gibbs! Paleo! 69M! 30M! 108M! 2GB! >1TB! !! Forest.7 Reuters and RCV1 are datasets for text classifica- ! ! NN! MNIST! 120M! 800K! 120M 2GB! >1TB! ! tion that are sparse and underdetermined. Music and Forest are standard benchmark datasets that are dense and overde- Figure 10: Dataset . NNZ refers to the termined. For QP and LR, we consider a social-network ap- Number of Non-zero elements. The # columns is plication, i.e., network , and use two datasets from equal to the number of variables in the model. Amazon’s customer data and Google’s Google+ social net- works.8 Figure 10 shows the dataset statistics. PerNode Per- quent from local NUMA memory in than in Metrics. We measure the quality and performance of Machine PerNode PerCore . The approach dominates the ap- DimmWitted and other competitors. To measure the qual- proach, as reads from the same node go to the same NUMA ity, we follow prior art and use the loss function for all func- PerCore memory. Thus, we do not consider replication from tions. For end-to-end performance, we measure the wall- this point on. clock time it takes for each system to converge to a loss that is within 100%, 50%, 10%, and 1% of the optimal loss.9 Tradeoffs. Not surprisingly, we observe that FullReplication When measuring the wall-clock time, we do not count the takes more time for each epoch than Sharding. However, time used for data loading and result outputting for all sys- we also observe that FullReplication uses fewer epochs than tems. We also use other measurements to understand the Sharding, especially to achieve low error. We illustrate these details of the tradeoff space, including (1) local LLC request, two observations by showing the result of running SVM on (2) remote LLC request, and (3) local DRAM request. We Reuters using PerNode in Figure 9. use Intel Performance Monitoring Units (PMUs) and follow Statistical Efficiency. FullReplication uses fewer epochs, the manual10 to conduct these experiments. especially to low-error tolerance. Figure 9(a) shows the number of epochs that each strategy takes to converge to a Experiment Setting. We compare DimmWitted with four FullRepli- given loss. We see that, for within 1% of the loss, competitor systems: GraphLab [21], GraphChi [17], ML- cation uses 10× fewer epochs on a two-node machine. This lib [33] over Spark [37], and Hogwild! [25]. GraphLab is Shard- is because each model replica sees more data than a distributed graph processing system that supports a large ing , and therefore has a better estimate. Because of this range of statistical models. GraphChi is similar to GraphLab FullReplication difference in the number of epochs, is 5× but with a focus on multi-core machines with secondary stor- Sharding faster in wall-clock time than to converge to 1% age. MLlib is a package of machine learning algorithms im- loss. However, we also observe that, at high-error regions, plemented over Spark, an in-memory implementation of the FullReplication Sharding uses more epochs than and causes a MapReduce framework. Hogwild! is an in-memory lock- comparable execution time to a given loss. free framework for statistical analytics. We find that all four Hardware Efficiency. Figure 9(b) shows the time for systems pick some points in the tradeoff space that we con- each epoch across different machines with different numbers sidered in DimmWitted. In GraphLab and GraphChi, all PerNode of nodes. Because we are using the strategy, which models are implemented using stochastic coordinate descent is the optimal choice for this dataset, the more nodes a ma- (column-wise access); in MLlib and Hogwild!, SVM and LR FullReplication chine has, the slower is for each epoch. The are implemented using stochastic gradient descent (row-wise slow-down is roughly consistent with the number of nodes access). We use implementations that are provided by the on each machine. This is not surprising because each epoch original developers whenever possible. For models with- FullReplication Sharding of processes more data than . out code provided by the developers, we only change the corresponding gradient function.11 For GraphChi, if the

4. EXPERIMENTS 4archive.ics.uci.edu/ml/datasets/Reuters-21578+ Text+Categorization+Collection We validate that exploiting the tradeoff space that we de- 5 DimmWitted about.reuters.com/researchandstandards/corpus/ scribed enables ’s orders of magnitude speedup 6 over state-of-the-art competitor systems. We also validate archive.ics.uci.edu/ml/datasets/YearPredictionMSD 7archive.ics.uci.edu/ml/datasets/Covertype that each tradeoff discussed in this paper affects the perfor- 8 mance of DimmWitted. snap.stanford.edu/data/ 9We obtain the optimal loss by running all systems for one 4.1 Experiment Setup hour and choosing the lowest. 10software.intel.com/en-us/articles/ We describe the details of our experimental setting. performance-monitoring-unit-guidelines 11For sparse models, we change the dense vector data struc- Datasets and Statistical Models. We validate the per- ture in MLlib to a sparse vector, which only improves its formance and quality of DimmWitted on a diverse set of performance.

1289 1290 DimmWitted outperforms Hogwild! by more than two or- SVM ! LR! LS! LP! QP! Parallel! DimmWitted (RCV1)! (RCV1)! (RCV1)! (Google)! (Google)! Sum! ders of magnitude. On LP and QP, is also up GraphLab! 0.2! 0.2! 0.2! 0.2! 0.1! 0.9! to 3× faster than GraphLab and GraphChi, and two orders GraphChi! 0.3! 0.3! 0.2! 0.2! 0.2! 1.0! of magnitude faster than MLlib. MLlib! 0.2! 0.2! 0.2! 0.1! 0.02! 0.3! Hogwild!! 1.3! 1.4! 1.3! 0.3! 0.2! 13! DIMMWITTED! 5.1! 5.2! 5.2! 0.7! 1.3! 21! Tradeoff Choices. We dive more deeply into these numbers to substantiate our claim that there are some points in the tradeoff space that are not used by GraphLab, GraphChi, Figure 13: Comparison of Throughput Hogwild!, and MLlib. Each tradeoff selected by our sys- (GB/seconds) of Different Systems on local2. tem is shown in Figure 14. For example, GraphLab and GraphChi uses column-wise access for all models, while ML- Access Methods! Model Replication! Data Replication! lib and Hogwild! use row-wise access for all models and al- SVM! Reuters! low only PerMachine model replication. These special points LR! RCV1! Row-wise! PerNode! FullReplication! LS! Music! work well for some but not all models. For example, for LP Amazon! LP! Column-wise! PerMachine! FullReplication! and QP, GraphLab and GraphChi are only 3× slower than QP! Google! DimmWitted, which chooses column-wise and PerMachine. This factor of 3 is to be expected, as GraphLab also allows Figure 14: Plans that DimmWitted Chooses in the distributed access and so has additional overhead. However Tradeoff Space for Each Dataset on Machine local2. there are other points: for SVM and LR, DimmWitted outperforms GraphLab and GraphChi, because the column- wise algorithm implemented by GraphLab and GraphChi is not as efficient as row-wise on the same dataset. DimmWit- the throughput on all systems on different models on one DimmWitted ted outperforms Hogwild! because DimmWitted takes ad- dataset. We see from Figure 13 that achieves vantage of model replication, while Hogwild! incurs 11× the highest throughput of all the systems. For parallel sum, DimmWitted more cross-node DRAM requests than DimmWitted; in is 1.6× faster than Hogwild!, and we find that DimmWitted contrast, DimmWitted incurs 11× more local DRAM re- incurs 8× fewer LLC cache misses than Hog- quests than Hogwild! does. wild!. Compared with Hogwild!, in which all threads write DimmWitted For SVM, LR, and LS, we find that DimmWitted out- to a single copy of the sum result, maintains performs MLlib, primarily due to a different point in the one single copy of the sum result per NUMA node, so the tradeoff space. In particular, MLlib uses batch-gradient- workers on one NUMA node do not invalidate the cache descent with a PerCore implementation, while DimmWitted on another NUMA node. When running on only a single DimmWitted uses stochastic gradient and PerNode. We find that, for the thread, has the same implementation as Hog- DimmWit- Forest dataset, DimmWitted takes 60× fewer number of wild!. Compared with GraphLab and GraphChi, ted epochs to converge to 1% loss than MLlib. For each epoch, is 20× faster, likely due to the overhead of GraphLab DimmWitted is 4× faster. These two factors contribute to and GraphChi dynamically scheduling tasks and/or main- DimmWitted the 240× speed-up of DimmWitted over MLlib on the For- taining the graph structure. To compare with est dataset (1% loss). MLlib has overhead for scheduling, so MLlib, which is written in Scala, we implemented a Scala we break down the time that MLlibuses for scheduling and version, which is 3× slower than C++; this suggests that computation. We find that, for Forest, out of the total 2.7 the overhead is not just due to the language. If we do not seconds of execution, MLlib uses 1.8 seconds for computa- count the time that MLlibuses for scheduling and only count DimmWitted tion and 0.9 seconds for scheduling. We also implemented the time of computation, we find that is 15× a batch-gradient-descent and PerCore implementation inside faster than MLlib. DimmWitted to remove these and C++ versus Scala dif- DimmWitted ferences. The 60× difference in the number of epochs until 4.3 Tradeoffs of convergence still holds, and our implementation is only 3× We validate that all the tradeoffs described in this paper faster than MLlib. This implies that the main difference be- have an impact on the efficiency of DimmWitted. We re- tween DimmWitted and MLlib is the point in the tradeoff port on a more modern architecture, local4 with four NUMA space—not low-level implementation differences. sockets, in this section. We describe how the results change For LP and QP, DimmWitted outperforms MLlib and with different architectures. Hogwild! because the row-wise access method implemented by these systems is not as efficient as column-wise access on 4.3.1 Access Method Selection the same data set. GraphLab does have column-wise access, We validate that different access methods have different DimmWitted so outperforms GraphLab and GraphChi be- performance, and that no single access method dominates DimmWitted cause finishes each epoch up to 3× faster, the others. We run DimmWitted on all statistical models primarily due to low-level issues. This supports our claims and compare two strategies, row-wise and column-wise. In that the tradeoff space is interesting for analytic engines and each experiment, we force DimmWitted to use the corre- that no one system has implemented all of them. sponding access method, but report the best point for the other tradeoffs. Figure 12(a) shows the results as we mea- Throughput. We compare the throughput of different sys- sure the time it takes to achieve each loss. The more strin- tems for an extremely simple task: parallel sums. Our im- gent loss requirements (1%) are on the left-hand side. The plementation of parallel sum follows our implementation of horizontal line segments in the graph indicate that a model other statistical models (with a trivial update function), may reach, say, 50% as quickly (in epochs) as it reaches and uses all cores on a single machine. Figure 13 shows 100%.

1291 1292 (a)! (b)! !

! other only by links across consecutive layers. The value of

! )

'" d g Classic Choice! n me !##" one neuron is a function of all the other neurons in the pre- i Sharding Better on i &" ! DimmWitted

! ! d T ) ec . %" s vious layer and a set of weights. Variables in the last layer ar / c on FullRepl. Better! s $" e !#" Sh l have human labels as training data; the goal of deep neural illi ./

!" ab

i network learning is to find the set of weights that maximizes #" (M !" ar o of Exe #(!)" !(#)" !#(#)" !##(#)" V Gibbs NN the likelihood of the human labels. Back-propagation with

ati ! ! # FullRepl R stochastic gradient descent is the de facto method of opti- ( mizing a deep neural network. Figure 17: (a) Tradeoffs of Data Replication. A ra- Following LeCun et al. [19], we implement SGD over a tio smaller than 1 means that FullReplication is faster. seven-layer neural network with 0.12 billion neurons and 0.8 million parameters using a standard handwriting-recognition (b) Performance of Gibbs Sampling and Neural Net- 13 works Implemented in DimmWitted. benchmark dataset called MNIST . Figure 17(b) shows the number of variables that are processed by DimmWitted per second. For this application, DimmWitted uses PerN- each strategy to converge to a given loss for SVM on the ode and FullReplication, and the classical choice made by Le- same dataset, RCV1. We report the ratio of these two Cun is PerMachine and Sharding. As shown in Figure 17(b), strategies as FullReplication/Sharding in Figure 17(a). We DimmWitted achieves more than an order of magnitude see that, for the low-error region (e.g., 0.1%), FullReplication higher throughput than this classical baseline (to achieve is 1.8-2.5× faster than Sharding. This is because FullReplica- the same quality as reported in this classical paper). tion decreases the skew of data assignment to each worker, so hence each individual model replica can form a more ac- 6. RELATED WORK curate estimate. For the high-error region (e.g., 100%), we observe that FullReplication appears to be 2-5× slower than We review work in four main areas: statistical analytics, Sharding. We find that, for 100% loss, both FullReplication algorithms, shared-memory multiprocessors op- and Sharding converge in a single epoch, and Sharding may timization, and main-memory databases. We include more therefore be preferred, as it examines less data to complete extensive related work in the full version of this paper. that single epoch. In all of our experiments, FullReplication Statistical Analytics. There is a trend to integrate sta- is never substantially worse and can be dramatically better. tistical analytics into data processing systems. Database Thus, if there is available memory, the FullReplication data vendors have recently put out new products in this space, in- replication seems to be preferable. cluding Oracle, Pivotal’s MADlib [13], IBM’s SystemML [12], and SAP’s HANA. These systems support statistical analyt- 5. EXTENSIONS ics in existing data management systems. A key challenge for statistical analytics is performance. We briefly describe how to run Gibbs sampling (which A handful of data processing frameworks have been de- uses a column-to-row access method) and deep neural net- veloped in the last few years to support statistical ana- works (which uses a row access method). Using the same lytics, including Mahout for Hadoop, MLI for Spark [33], tradeoffs, we achieve a significant increase in speed over the GraphLab [21], and MADLib for PostgreSQL or Green- classical implementation choices of these algorithms. A more plum [13]. Although these systems increase the perfor- detailed description is in the full version of this paper. mance of corresponding statistical analytics tasks signifi- 5.1 Gibbs Sampling cantly, we observe that each of them implements one point in DimmWitted’s tradeoff space. DimmWitted is not a Gibbs sampling is one of the most popular algorithms system; our goal is to study this tradeoff space. to solve statistical inference and learning over probabilis- Data Mining Algorithms. There is a large body of tic graphical models [30]. We briefly describe Gibbs sam- data mining literature regarding how to optimize various al- pling over factor graphs and observe that its main step is a gorithms to be more architecturally aware [26, 38, 39]. Zaki column-to-row access. A factor graph can be thought of as et al. [26,39] study the performance of a range of different al- a bipartite graph of a set of variables and a set of factors. gorithms, including associated rule mining and decision tree To run Gibbs sampling, the main operation is to select a on shared-memory machines, by improving memory locality single variable, and calculate the conditional probability of and data placement in the granularity of cachelines, and de- this variable, which requires the fetching of all factors that creasing the cost of coherent maintenance between multiple contain this variable and all assignments of variables con- CPU caches. Ghoting et al. [11] optimize the cache behavior nected to these factors. This operation corresponds to the of frequent pattern mining using novel cache-conscious tech- column-to-row access method. Similar to first-order meth- niques, including spatial and temporal locality, prefetching, ods, recently, a Hogwild! algorithm for Gibbs was estab- and tiling. Jin et al. [14] discuss tradeoffs in replication and lished [15]. As shown in Figure 17(b), applying the tech- locking schemes for K-means, association rule mining, and DimmWitted nique in to Gibbs sampling achieves 4× the neural nets. This work considers the hardware efficiency of throughput of samples as the PerMachine strategy. the algorithm, but not statistical efficiency, which is the fo- 5.2 Deep Neural Networks cus of DimmWitted. In addition, Jin et al. do not consider lock-free execution, a key aspect of this paper. Neural networks are one of the most classic machine learn- Shared-memory Multiprocessor Optimization. Per- ing models [22]; recently, these models have been intensively formance optimization on shared-memory multiprocessors revisited by adding more layers [10, 18]. A deep neural net- machines is a classical topic. Anderson and Lam [3] and work contains multiple layers, and each layer contains a set of neurons (variables). Different neurons connect with each 13yann.lecun.com/exdb/mnist/

1293 Carr et al.’s [7] seminal work used complier techniques to im- [6] L. Bergstrom. Measuring NUMA effects with the STREAM prove locality on shared-memory multiprocessor machines. benchmark. ArXiv e-prints, 2011. DimmWitted’s locality group is inspired by Anderson and [7] S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In ASPLOS, 1994. Lam’s discussion of computation decomposition and data de- [8] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, composition. These locality groups are the centerpiece of the and K. Olukotun. A domain-specific approach to heterogeneous Legion project [5]. In recent years, there have been a va- parallelism. In PPOPP, pages 35–46, 2011. domain specific languages [9] C. Chasseur and J. M. Patel. Design and evaluation of storage riety of (DSLs) to help the user organizations for read-optimized main memory databases. extract parallelism; two examples of these DSLs include Ga- PVLDB, pages 1474–1485, 2013. lois [23, 24] and OptiML [35] for Delite [8]. Our goals are [10] J. Dean and et al. Large scale distributed deep networks. In orthogonal: these DSLs require knowledge about the trade- NIPS, pages 1232–1240, 2012. [11] A. Ghoting and et al. Cache-conscious frequent pattern mining offs of the hardware, such as those provided by our study. on modern and emerging processors. VLDBJ, 2007. Main-memory Databases. The database community [12] A. Ghoting and et al. SystemML: Declarative machine learning has recognized that multi-socket, large-memory machines on MapReduce. In ICDE, pages 231–242, 2011. have changed the data processing landscape, and there has [13] J. M. Hellerstein and et al. The MADlib analytics library: Or been a flurry of recent work about how to build in-memory MAD skills, the SQL. PVLDB, pages 1700–1711, 2012. [14] R. Jin, G. Yang, and G. Agrawal. Shared memory analytics systems [2,4,9,16,20,27,28,36]. Classical tradeoffs parallelization of data mining algorithms: Techniques, have been revisited on modern architectures to gain signif- programming interface, and performance. TKDE, 2005. icant improvement: Balkesen et al. [4], Albutiu et al. [2], [15] M. J. Johnson, J. Saunderson, and A. S. Willsky. Analyzing Kim et al. [16], and Li [20] study the tradeoff for joins and Hogwild parallel Gaussian Gibbs sampling. In NIPS, 2013. [16] C. Kim and et al. Sort vs. hash revisited: Fast join shuffling, respectively. This work takes advantage of modern implementation on modern multi-core CPUs. PVLDB, 2009. architectures, e.g., NUMA and SIMD, to increase memory [17] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale bandwidth. We study a new tradeoff space for statistical graph computation on just a pc. In OSDI, pages 31–46, 2012. analytics in which the performance of the system is affected [18] Q. V. Le and et al. Building high-level features using large scale . In ICML, pages 8595–8598, 2012. by both hardware efficiency and statistical efficiency. [19] Y. LeCun and et al. Gradient-based learning applied to document recognition. IEEE, pages 2278–2324, 1998. [20] Y. Li and et al. NUMA-aware algorithms: the case of data 7. CONCLUSION shuffling. In CIDR, 2013. For statistical analytics on main-memory, NUMA-aware [21] Y. Low and et al. Distributed GraphLab: A framework for machines, we studied tradeoffs in access methods, model machine learning in the cloud. PVLDB, pages 716–727, 2012. replication, and data replication. We found that using novel [22] T. M. Mitchell. Machine Learning. McGraw-Hill, USA, 1997. [23] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight points in this tradeoff space can have a substantial bene- infrastructure for graph analytics. In SOSP, 2013. fit: our DimmWitted prototype engine can run at least [24] D. Nguyen, A. Lenharth, and K. Pingali. Deterministic Galois: one popular task at least 100× faster than other competitor On-demand, portable and parameterless. In ASPLOS, 2014. systems. This comparison demonstrates that this tradeoff [25] F. Niu and et al. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011. space may be interesting for current and next-generation [26] S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li. Parallel statistical analytics systems. data mining for association rules on shared memory systems. Knowl. Inf. Syst., pages 1–29, 2001. [27] L. Qiao and et al. Main-memory scan sharing for multi-core Acknowledgments We would like to thank Arun Kumar, Victor CPUs. PVLDB, pages 610–621, 2008. Bittorf, the Delite team, the Advanced Analytics at Oracle, Green- [28] V. Raman and et al. DB2 with BLU acceleration: So much plum/Pivotal, and Impala’s Cloudera team for sharing their experi- more than just a column store. PVLDB, pages 1080–1091, 2013. ences in building analytics systems. We gratefully acknowledge the [29] P. Richt´arik and M. Tak´aˇc. Parallel coordinate descent methods for optimization. ArXiv e-prints, 2012. support of the Defense Advanced Research Projects Agency (DARPA) [30] C. P. Robert and G. Casella. Monte Carlo Statistical Methods XDATA Program under No. FA8750-12-2-0335 and the DEFT Pro- (Springer Texts in Statistics). Springer, USA, 2005. gram under No. FA8750-13-2-0039, the National Science Founda- [31] A. Smola and S. Narayanamurthy. An architecture for parallel tion (NSF) CAREER Award under No. IIS-1353606, the Office of topic models. PVLDB, pages 703–710, 2010. Naval Research (ONR) under awards No. N000141210041 and No. [32] S. Sonnenburg and et al. The SHOGUN machine learning toolbox. J. Mach. Learn. Res., pages 1799–1802, 2010. N000141310129, the Sloan Research Fellowship, American Family In- [33] E. Sparks and et al. MLI: An API for distributed machine surance, Google, and Toshiba. Any opinions, findings, and conclusion learning. In ICDM, pages 1187–1192, 2013. or recommendations expressed in this material are those of the au- [34] S. Sridhar and et al. An approximate, efficient LP solver for LP thors and do not necessarily reflect the view of DARPA, NSF, ONR, rounding. In NIPS, pages 2895–2903, 2013. [35] A. K. Sujeeth and et al. OptiML: An Implicitly Parallel or the US government. Domain-Specific Language for Machine Learning. In ICML, pages 609–616, 2011. [36] S. Tu and et al. Speedy transactions in multicore in-memory 8. REFERENCES databases. In SOSP, pages 18–32, 2013. [1] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable [37] M. Zaharia and et al. Resilient distributed datasets: A effective terascale linear learning system. ArXiv e-prints, 2011. fault-tolerant abstraction for in-memory cluster computing. In [2] M.-C. Albutiu, A. Kemper, and T. Neumann. Massively NSDI, 2012. parallel sort-merge joins in main memory multi-core database [38] M. Zaki and et al. Parallel classification for data mining on systems. PVLDB, pages 1064–1075, 2012. shared-memory multiprocessors. In ICDE, pages 198–205, 1999. [3] J. M. Anderson and M. S. Lam. Global optimizations for [39] M. J. Zaki and et al. New algorithms for fast discovery of parallelism and locality on scalable parallel machines. In PLDI, association rules. In KDD, pages 283–286, 1997. pages 112–125, 1993. [4] C. Balkesen and et al. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, pages 85–96, 2013. [5] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC, page 66, 2012.

1294