(12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu Et Al

USOO8209.664B2 (12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu et al. (45) Date of Patent: Jun. 26, 2012 (54) HIGH LEVEL PROGRAMMING 2008, OO86442 A1 4/2008 Dasdan et al. 2008.OO98375 A1 4/2008 Isard EXTENSIONS FOR DISTRIBUTED DATA 2008/0120314 A1 5/2008 Yang et al. PARALLEL PROCESSING 2008/O127200 A1 5/2008 Richards et al. 2008/0250227 A1 10, 2008 Linderman et al. (75) Inventors: Yuan Yu, Cupertino, CA (US); Ulfar 2009/0043803 A1* 2/2009 Frishberg et al. ............. 707/102 Erlingsson, Reykjavik (IS); Michael A 2009/0300656 A1* 12/2009 Bosworth et al. ............. T19,320 Isard, San Francisco, CA (US); Frank OTHER PUBLICATIONS McSherry, San Francisco, CA (US) Blelloch, Guy E., et al., “Implementation of a Portable Nested Data (73) Assignee: Microsoft Corporation, Redmond, WA Parallel Language.” School of Computer Science, Carnegie Mellon (US) University, Pittsburgh, PA, Feb. 1993, Abstract, pp. 1-28. Isard, Michael, et al., “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks.” Microsoft Research, Sillicon Val (*) Notice: Subject to any disclaimer, the term of this ley, http://research.microsoft.com/research/dv/dryad/eurosysO7. patent is extended or adjusted under 35 pdf). Mar. 23, 2007, 14 pages. U.S.C. 154(b) by 704 days. Ferreira, Renato, et al., “Data parallel language and compiler Support for data intensive applications*1.” Elsevier Science B.V., Parallel (21) Appl. No.: 12/406,826 Computing, vol. 28, Issue 5, May, 2002, 3 pages. Carpenter, Bryan, et al., “HPJava: Data Parallel Extensions to Java.” (22) Filed: Mar 18, 2009 NPAC at Syracuse University, Syracuse, NY, Feb. 7, 1998, pp. 1-5. (65) Prior Publication Data (Continued) US 2010/O241827 A1 Sep. 23, 2010 Primary Examiner — Anna Deng (74) Attorney, Agent, or Firm — Vierra Magen Marcus & (51) Int. C. DeNiro LLP G06F 9/44 (2006.01) (52) U.S. Cl. ....................................................... T17/119 (57) ABSTRACT (58) Field of Classification Search .................. 717/106, General-purpose distributed data-parallel computing using 717/109, 114, 119 high-level computing languages is described. Data parallel See application file for complete search history. portions of a sequential program that is written by a developer (56) References Cited in a high-level language are automatically translated into a distributed execution plan. A set of extensions to a sequential U.S. PATENT DOCUMENTS high-level computing language are provided to Support dis 7,802,180 B2 * 9/2010 Warner et al. ................. T15,234 tributed parallel computations and to facilitate generation and 2005/0177585 A1 8/2005 Venkatesh et al. ............ 707/100 optimization of distributed execution plans. The extensions 2006/0116578 A1* 6/2006 Grunwald et al. ............ 600/440 are fully integrated with the programming language, thereby 2007/OO 11155 A1 1/2007 Sarkar enabling developers to write sequential language programs 2007/0O386.59 A1 2/2007 Datar et al. 2007/0043531 A1 2/2007 Kosche et al. ................ TO2, 182 using known constructs while providing the ability to invoke 2008.0005547 A1 1/2008 Papakipos et al. the extensions to enable better generation and optimization of 2008.0034.357 A1 2/2008 GSchwind the execution plan for a distributed computing environment. 2008.OO79724 A1 4/2008 Isard et al. 2008/0082644 A1 4/2008 Isard et al. 17 Claims, 18 Drawing Sheets 508 receive user-defined function with Apply operator homomorphic? generate vertex code to generate function vertex stream all dataset partitions code for all nodes having 512 to single node dataset partition generate function vertex Code for single node appying function US 8,209,664 B2 Page 2 OTHER PUBLICATIONS Catanzaro, Bryan, et al., “A Map Reduce Framework for Program ming Graphics Processors.” University of California, Berkeley, CA, Duffy, Joe, et al., “Parallel LINQ Running Queries on Multi-Core http://web.mit.edu/rabbah/www.conferences/08/stmcspapers/ Processors.” Microsoft Corporation, http://msdn.microsoft.com/en catanzaro-stmcs08.pdf, 2008, 6 pages. us/magazine/cc 163329.aspx. 2008, 8 pages. Dean, Jeffrey, et al., “MapReduce: Simplified Data Processing on Yu, Yuan, et al., “DryadLINQ: A System for General-Purpose Dis Large Clusters. Google, Inc. http://labs.google.com/papers/ tributed Data-Parallel Computing Using a High-Level Language.” 8th Usenix Symposium on Operating Systems Design and Imple mapreduce-osdi O4.pdf. OSDI 2004, pp. 1-13. mentation, Dec. 8, 2008, pp. 1-14. Olston, Christopher, et al., “Automatic Optimization of Parallel Yu, Yuan, et al., “Some sample programs written in DryadLINQ.” Dataflow Programs.” Yahoo! Research, http://www.usenix.org/ Microsoft Corporation, ftp://ftp.research.microsoft.com/pub?tr/TR event/usenix08/tech/full papers?olston?olston html/index.html). 2008-74.pdf, May 11, 2008, pp. 1-37. Apr. 29, 2008, 11 pages. Isard, Michael, Dryad and DryadLINQ, Presentation to Hadoop U.S. Appl. No. 12/368,231, filed Feb. 9, 2009. Summit, Microsoft Corporation, Mar. 25, 2008, 15 pages. Isard, Michael, et al., “Simple programming for simple distributed systems.” Cambridge SRG, Jun. 9, 2005. * cited by examiner U.S. Patent US 8,209,664 B2 ??un61-I U.S. Patent Jun. 26, 2012 Sheet 2 of 18 US 8,209,664 B2 Figure 2 106 100 108 110 116 114 Input Device(s) 112 Communication 17 Connection(s) 152 154 156 158 160 162 170 172 174 176 178 18O U.S. Patent Jun. 26, 2012 Sheet 3 of 18 US 8,209,664 B2 Figure 4 228 Generate execution plan graph 4O2 Perform static optimizations 404 4 O Generate COce and Static data for Vertices 4O6 Generate Serialized Objects and Serialization- 408 Code for required data types Generate Code for dynamic optimizations 410 Figure 8 U.S. Patent Jun. 26, 2012 Sheet 4 of 18 US 8,209,664 B2 Figure 5 create and build graph based on EPG 242 receive list Of nodes from Name Server 244 determine Which nodes are available 246 populate available nodes into Node Queue 248 populate vertices into Vertex Queue 250 identify vertices that are ready 252 Send instructions to available nodes to 254 execute vertices that are ready Send COOle to available nodes for Vertices that are ready, if not cached (on same 256 machine or locally) 258 260 manage Node Queue manage Vertex Queue U.S. Patent Jun. 26, 2012 Sheet 5 of 18 US 8,209,664 B2 Developer Application Program 310 Expression 312 LOCal Table Objects Distributed Eston Provider 324 Serialized DataObjects Execution and Serialization COce Plan Graph 316 318 Distributed Compute System 304 Compute System Output Tables 322 Figure 6 U.S. Patent Jun. 26, 2012 Sheet 6 of 18 US 8,209,664 B2 099 Je6eue.Wqof Z?un61-I 899 U.S. Patent Jun. 26, 2012 Sheet 7 of 18 US 8,209,664 B2 U.S. Patent Jun. 26, 2012 Sheet 8 of 18 US 8,209,664 B2 receive expression having Figure 10 input, intermediate result or 452 dataset expression include user-Supplied partitioning operator? 458 add default dynamic add user-supplied data partitioning operator to partitioning operator to graph graph Continue compiling graph U.S. Patent Jun. 26, 2012 Sheet 9 of 18 US 8,209,664 B2 470 public Oueryable.<TSource> Hash Partition<TSource, TKey>(Queryable-TSource> source, Expression<Func-TSource, TKey>> keySelector, EqualityComparer(TKey> Comparer, int Count) Figure 11A 472 public Queryable:CTSource> HashPartition<TSource, TKey>(Queryable.<TSource> source, Expression<Func<TSource, TKey>> keySelector, |EqualityComparerCTKey> Comparer) Figure 11B 474 public static Queryable.<TSource> RangePartition<TSource, TKey>(IQueryable.<TSource> source, Expression<FunczTSource, TKey>> keySelector, TKey rangeKeys, ComparerCTKey> Comparer) Figure 12A 476 public static Queryable(TSource> RangePartition<TSource, TKey>(IQueryable CTSource> source, Expression<Func-TSource, TKey>> keySelector, |ComparerCTKey> Comparer) Figure 12B U.S. Patent Jun. 26, 2012 Sheet 10 of 18 US 8,209,664 B2 480 public static MultiQueryable.<R1, R22 Fork<T, R1, R2-(Queryable<T> source, Expression<Funcg|Enumerable-T2, IEnumerable<ForkTuple.<R1,R2>>>> mapper) Figure 13A 482 public static MultiQueryable.<R1, R2, R32 Fork<T, R1, R2, R32(IQueryable-Ta source, Expression<Funcg|Enumerable-Ta, IEnumerable, ForkTuple.<R1, R2, R322>> mapper) Figure 13B 484 public static Keyed MultiQueryable.<T, K-> Fork<T, K>(IQueryableCT> source, Expression<Func-T, KY keySelector, KD keys) Figure 13C 498 war q = source. Fork(p => new ForkTuple(int, inte(p.X, p.y)); war q1 = q. First, war q2 = q. Second; Figure 15 U.S. Patent Jun. 26, 2012 Sheet 11 of 18 US 8,209,664 B2 access Source dataset 490 Figure 14 access user-defined mapper function (or keyselector function 492 and keys) map records of type T in source dataset into different datastreams of 494 tuples of type R1, R2, etc. return object with embedded 496 datastreams 508 receive USer-defined function with Apply operator Figure 17 homomorphic? generate vertex Code to generate function vertex stream all dataset partitions Code for all nodes having 512 to single node dataset partition generate function vertex code for single node appying function U.S. Patent Jun. 26, 2012 Sheet 12 of 18 US 8,209,664 B2 502 public static IQueryable.<T2> Apply<T1, T2>(Queryable.<T1> source, Expression<Funcg.IEnumerable CT12, IEnumerable CT2>>> procFunc) Figure 16A 504 public static IQueryable.<T3> Apply<T1, T2, T3>(Queryable<T1> source1, Queryable-T2> source2, Expression<Funcg|Enumerable.<T1>, Enumerable.<T2>, IEnumerable<T3>>> procFunc) Figure 16B 506 public static IEnumerable.<double> SlidingAverage(IEnumerable.<int> source) { int window = new int10); int i = 0; foreach (int x in source) { windowi) = x; i = (i+1) %10; yield return (window. Sum()/10.0); } Figure 16C U.S. Patent Jun. 26, 2012 Sheet 13 of 18 US 8,209,664 B2 520 Homomorphic public static Enumerable-int> addeach(IEnumerable.<int> left, Enumerable.<int> right) { IEnumerator-int> left enu = left.GetEnumerator(); IEnumerator-int> right enu = right.GetEnumerator(); while (true) { bool more left = left enu.

Load more