!
):8$+B3@9$-:>:>8$'$ *B3=6F?B;C$3>5$2649>:AE6C
167:JG 'G>9=> 6C9, #@97>47DF )7B:G *9G;8A
#3,!-12, ><1@N#6GI= 5DG@H=DE 1DEGDC,Z&JC<6GN[Q,$:7GJ6GN bcYbdQ,b`ag
&;@5B-@5;:((G( )1?1->/4 @;<5/( -JG,G:H:6G8=,I:6B,R, "3F3N4.7DH;57E43@64&@F7DAB7D34;>;FK 2LD B6>C 9>G:8I>DCH ;DG >< "6I6 R d 3J]J%;JWJPNVNW] d ?^N[b%XY]RVR\J]RXW JYY[XJLQN\ d 8W]NP[J]NM%\b\]NV\ d 7bK[RM%\]X[JPN%\b\]NV\ d 8WMNaRWP%]NLQWRZ^N%OX[%KRXUXPRLJU%MJ]J
d >J]]N[W%VRWRWP%f ;JLQRWN%UNJ[WRWP ! d @^UN%VRWRWP d 6[JYQ%VRWRWP% d 2UX^M%LXVY^]RWP%/%>J[]R]RXWRWP%JYY[XJLQ d 4_XU_RWP%P[JYQ\ d >[NON[NWLN\%RW%MJ]J%VRWRWP d ;R\\RWP%MJ]J%RW%AYJ]RJU%3J]J`J[NQX^\N d @NLXVVNWMN[%\b\]NV\ &;@5B-@5;:(G( &A8@505?/5<85:->5@E 2=:B: "1',,YYY HDB: .GD?:8IH R,
(34#J &)A4. \ [YZ[T[Y[[ d \'CCDK6I>K: +D7>A>INR,1B6GI,6C9,1JHI6>C67A: 1DAJI>DCH\,2G68@ ]b &@H7EF;EE7?7@F46R3H7@;D4 D7761:73F [YZZT[Y[Y d 5=:6IQ, >DADDC,D;,B6HH>K:,H8>:CI>;>8 96I6,;GDB 6HIGDCDBN D7H:GK6I>DCH d #3,!-12,.GD?:8I, ><1@N#6GI= b`aeYb`ah $D7@5:T D3L;> !*-.T&*-&KT$0*!K,4,DA<75F4(./%444[YZ^T[YZ_ d 3!',X,314/,,Y 3C>K:GH>I69: $:9:G6A 9D,!:6G6,Z$DGI6A:O6[ d !$/('2"%$+' #*,'-.'.'-&'-1 /$.)0 &;@5B-@5;:(G( &-?@;0;:? <>;61/@ (1@-?7E &;@5B-@5;:(G( 0-@-(?@>A/@A>1 %G6E=H,6G:,K:GN,JH:;JA,;DG,BD9:A>C<,K6G>:IN,D;,:CI>I>:H,6C9,I=:>G,G:A6I>DCH
273EF4BDAF7;@4;@F7D35F;A@4 @7FIAD= Internet
!:7?;53>45A?BAG@6E !AT3GF:AD4@7FIAD= &;@5B-@5;:(G( %->31($>-<4?(5:(@41(C;>80(F
d Social graphs : Facebook, Twitter, Google+, LinkedIn, etc… Facebook [2012] : a billion users (nodes) and more than 140 billion friendship relationships (edges). • Endorsement graphs : web link graph, paper citation graph, etc… #web pages = 30 billions or more, and #devices = probably more than a billion. d Location graphs : map, power grid, phone network, etc… d Co-occurrence graphs : term-document bipartite, click-through bipartite, etc… MEDLINE adds from 1 to 140 publications per day &;@5B-@5;:(G( ",(!'*+((>;61/@( 53*7E"->@4 ! (>;?<1/@5B1 I:7D7 3D74%D3B:E4;@4#3DF: AD4.=K M Y !S,18=L6GOUH 26A@,DC,#6GI= D7H:GK6I>DC,R Y 'B6<:,6C6ANH>H >C,I=:,;:6IJG: 9DB6>C Y #MIG68I ;:6IJG:HQ,86I:
Y "S,4>C@DK>8U26A@ DC,V $DG:HI,"6I6,EGD?:8I W Y %G6E=,BD9:A>C< R,6,IG:: >H CD9:Q,6C9,G:A6I>DCH=>E 86C 7: C:><=7DG=DD9Q,T ! Y .S, 6JB6CCUH 26A@,DC,"6I68J7:H R,06H96B6C ZBJAI>EA:,6GG6N[ Y P,1E6GH>INQ,:BEIN 8:AAH >C,BJAI>9>B:CH>DCC6A>IN &;@5B-@5;:(G( $>-<4?(->1(1B1>EC41>1 F 5: C::9 >< %G6E=,'C6ANI>8H 2=>H,I6A@,>H R Y 6,HJGK:ND;, ,$*- ;G6B:LDG@H 6C9,I:8=C>FJ:H Y EGD767AN CDI,:M=6JHI>K: Y CDI 6,9::EX:ME:G>B:CI6A 8DBE6G>HDC 7:IL::C IDDAH
1DJG8:,R,=IIERXX9B
&'&' #>1=A1:@(?A.3>-<4(05?/;B1>E( d $>C9>C<,HJ7
#J3?B>7O
1 2 2 1 1 ' ) 2 % % ) ) 4 1 % + 1 1 $ 2 $ + $ +
%D3B:463F343E7 2 1 2 2 1 2 2 1 ) 1 1 ) ) 1 ) ) QP ) 1 1 1 1 1 1 $ $ $ $ [email protected]:E
U?;@;?G?4EGBBADF4`4\V ac 'A@85:1 d $>-<4?(-:0(3>-<4(95:5:3 d 53 0-@-(2>-91C;>7?H-:-8E@5/? d 53 3>-<4(2>-91C;>7?H-:-8E@5/? d *;91 /;:@>5.A@5;:? d !;:/8A?5;: Big Data Framework MapReduce : d2004 : Google dA programming model and not an algorithm dExecuted in a distributed environment dSeveral implementations such as Hadoop dBased on two functions (Map and Reduce) dBased on key-value format
&* Big Data Frameworks MapReduce :
Example : mothergooseclub.com d Peter Piper picked a peck of pickled pepers d A peck of pickled pepers Peter Piper picked d If Peter Piper picked a peck of pickled pepers d Where’s the peck of pickled pepers Peter Piper Picked ?
&+ Big Data Frameworks MapReduce / Hadoop : Google 2004 Strength : General-purpose framework, powerful and simple, open-source Limitation : General-purpose framework ! e.g. : Less efficient on relational data
Some solutions : Problem and data-specific d Pig Latin : Yahoo (2008) – Iterative data processing d Hive : Facebook (2009) – SQL-like workload d Pregel : Google (2010) – Graph processing. !. GraphLab (2011), PowerGraph (2012) ! Giraph : Facebook (2012) --- Iterative graph processing (Pregel open source) ! Trinity : Microsoft (2013) --- started 2010 d SpatialHadoop : U. Minnesota (2013) – Geospatial data processing
0U]N[WJ]R_N\%/ d Dryad : Microsoft (2007) – https://www.microsoft.com/en-us/research/project/dryad/ started in dec. 2004 d Quite expressive d Subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra d Commercial (not open-source) d … &, Big Data Analytics more …. d 1@3B; NGJQJ[RJ !$#! #!%CA4<8F%'%&% Q]]Y\/$$\YJ[T#JYJLQN#X[P$ d J%YX`N[O^U PNWN[JU Y^[YX\N Y[XLN\\RWP O[JVN`X[T ]QJ] Y[X_RMN\ JW%NJ\N XO%^\N%]XXU OX[% 677:4:6>D$3>3
&. Big Data ML-DM
AYJ[T Big Data ML-DM
PARMA [Riondato etal , CIKM 2012]
d Tool for mining frequent itemsets and association rules d Run in a distributed mode d Based on MapReduce
'& Big Data ML-DM
SystemML [Ghoting etal , ICDE 2011] d Distributed framework d Provides Machine learning solutions d Run on top of Hadoop d Based on a Declarative Machine Learning Language d High-Level Operator Component d Low-Level Operator Component
0%_N[\RXW%XW%AYJ[T HD:31%'%&+I '' Big Data Analytics
Mahout http://mahout.apache.org/ dSet of scalable machine learning algorithms dBased on Hadoop MapReduce dExamples:
!Recommender systems !Distributed Principal Components Analysis (PCA) techniques for dimensionality reduction !Clustering algorithms
'( 'A@85:1 d $>-<4?(-:0(3>-<4(95:5:3 d 53 0-@-(2>-91C;>7?H-:-8E@5/? d 53 $>-<4(2>-91C;>7?H-:-8E@5/? d *;91 /;:@>5.A@5;:? d !;:/8A?5;: Graph Processing Frameworks
JU\X /%B[RWR]b%N;RL[X\XO] !%6R[JYQ N5JLNKXXT !%e Graph Processing Frameworks Pregel dScalable and fault tolerant dVertex-oriented computing dFlexible API to use in various graphs processing problems dCompatible with the Hadoop ecosystem and master/slaves architecture. dBased on the communications and the exchange of messages between the vertices.
'+ Graph Processing Frameworks Running a Pregel program
1. Send the graph to master node. 2. Load the graph in the master (in GFS or BigTable). 3. Assign sub-graph to workers candidates. 4. Each worker follows the execution of their partitions (communication, exchange of messages, ..) 5. The master follows the states of the workers to get the final results. 6. Send the final results to user
', Graph Processing Frameworks
GraphX dDistributed and In-memory processing framework dSub project of Spark dCompatible with many systems ( HDFS , Apache Spark ) dAllows persistence to the RAM dUses the @N\RURNW]%3R\][RK^]NM%3J]J\N]\%N RDD) concept dUses the Data Frame in the Spark project
'- Graph Processing Frameworks GraphX
>J[]R]RXW%& >J[]R]RXW%& DN[]RLN\ 4MPN\ 1 3 1 F1 1 2 F1 Worker 1 1 3 2 F2 1 3 F2 2 3 F3 2 3 F3 4 2 5 2 4 F4 4 5 F1 5 4 5 F5 4 2 F2 Worker 2 >J[]R]RXW%' 2 F2 5 2 F3 >J[]R]RXW%' DN[]RLN\ 4MPN\
'.