!

):8$+B3@9$-:>:>8$'$ *B3=6F?B;C$3>5$2649>:AE6C

167:JG 'G>9=> 6C9, #@97>47DF )7B:G *9G;8A

#3,!-12, ><1@N#6GI= 5DG@H=DE 1DEGDC,Z&JC<6GN[Q,$:7GJ6GN bcYbdQ,b`ag

&;@5B-@5;:((G( )1?1->/4 @;<5/( -JG,G:H:6G8=,I:6B,R, "3F3N4.7DH;57E43@64&@F7DAB7D34;>;FK 2LD B6>C 9>G:8I>DCH ;DG >< "6I6 R d 3J]J%;JWJPNVNW] d ?^N[b%XY]RVR\J]RXW JYY[XJLQN\ d 8W]NP[J]NM%\b\]NV\ d 7bK[RM%\]X[JPN%\b\]NV\ d 8WMNaRWP%]NLQWRZ^N%OX[%KRXUXPRLJU%MJ]J

d >J]]N[W%VRWRWP%f ;JLQRWN%UNJ[WRWP ! d @^UN%VRWRWP d 6[JYQ%VRWRWP% d 2UX^M%LXVY^]RWP%/%>J[]R]RXWRWP%JYY[XJLQ d 4_XU_RWP%P[JYQ\ d >[NON[NWLN\%RW%MJ]J%VRWRWP d ;R\\RWP%MJ]J%RW%AYJ]RJU%3J]J`J[NQX^\N d @NLXVVNWMN[%\b\]NV\ &;@5B-@5;:(G( &A8@505?/5<85:->5@E 2=:B: "1',,YYY HDB: .GD?:8IH R,

(34#J &)A4. \ [YZ[T[Y[[ d \'CCDK6I>K: +D7>A>INR,1B6GI,6C9,1JHI6>C67A: 1DAJI>DCH\,2G68@ ]b &@H7EF;EE7?7@F46R3H7@;D4 D7761:73F [YZZT[Y[Y d 5=:6IQ, >DADD>C;DGB6I>8H Q,%:C:I>8HQ,%:CDB>8HQ,#8DE=NH>DAD<=,I=GDJ<=EJI E=:CDINE>C< 6C9,<:CDINE>CDC,D;,B6HH>K:,H8>:CI>;>8 96I6,;GDB 6HIGDCDBN D7H:GK6I>DCH d #3,!-12,.GD?:8I, ><1@N#6GI= b`aeYb`ah $D7@5:T D3L;> !*-.T&*-&KT$0*!K,4,DA<75F4(./%444[YZ^T[YZ_ d 3!',X,314/,,Y 3C>K:GH>I69: $:9:G6A 9D,!:6G6,Z$DGI6A:O6[ d !$/('2"%$+' #*,'-.'.'-&'-1 /$.)0 &;@5B-@5;:(G( &-?@;0;:? <>;61/@ (1@-?7E &;@5B-@5;:(G( 0-@-(?@>A/@A>1 %G6E=H,6G:,K:GN,JH:;JA,;DG,BD9:A>C<,K6G>:IN,D;,:CI>I>:H,6C9,I=:>G,G:A6I>DCH

273EF4BDAF7;@4;@F7D35F;A@4 @7FIAD= Internet

!:7?;53>45A?BAG@6E !AT3GF:AD4@7FIAD= &;@5B-@5;:(G( %->31($>-<4?(5:(@41(C;>80(F

d Social graphs : , Twitter, +, LinkedIn, etc… Facebook [2012] : a billion users (nodes) and more than 140 billion friendship relationships (edges). • Endorsement graphs : web link graph, paper citation graph, etc… #web pages = 30 billions or more, and #devices = probably more than a billion. d Location graphs : map, power grid, phone network, etc… d Co-occurrence graphs : term-document bipartite, click-through bipartite, etc… MEDLINE adds from 1 to 140 publications per day &;@5B-@5;:(G( ",(!'*+((>;61/@( 53*7E"->@4 ! (>;?<1/@5B1 I:7D7 3D74%D3B:E4;@4#3DF: AD4.=K M Y !S,18=L6GOUH 26A@,DC,#6GI= D7H:GK6I>DC,R Y 'B6<:,6C6ANH>H >C,I=:,;:6IJG: 9DB6>C Y #MIG68I ;:6IJG:HQ,86I::HQ,6C9,D7?:8IH Z$:6IJG:H :MIG68IDG[ Y 'B6<:,6C6ANH>H >C,I=:,HI68@:9 96I6,9DB6>C ZA>C@:9 96I6[ Y 'B6<:,6C6ANH>H 7N,K>HJ6A>O6I>DC Y =><=,A:K:A 9:H8G>EIDGH ID,>9:CI>;N G:A6I>DCH=>EH 6C9,JC:ME:8I:9 :K:CIH

Y "S,4>C@DK>8U26A@ DC,V $DG:HI,"6I6,EGD?:8I W Y %G6E=,BD9:A>C< R,6,IG:: >H CD9:Q,6C9,G:A6I>DCH=>E 86C 7: C:><=7DG=DD9Q,T ! Y .S, 6JB6CCUH 26A@,DC,"6I68J7:H R,06H96B6C ZBJAI>EA:,6GG6N[ Y P,1E6GH>INQ,:BEIN 8:AAH >C,BJAI>9>B:CH>DCC6A>IN &;@5B-@5;:(G( $>-<4?(->1(1B1>EC41>1 F 5: C::9 >< %G6E=,'C6ANI>8H 2=>H,I6A@,>H R Y 6,HJGK:ND;, ,$*- ;G6B:LDG@H 6C9,I:8=C>FJ:H Y EGD767AN CDI,:M=6JHI>K: Y CDI 6,9::EX:ME:G>B:CI6A 8DBE6G>HDC 7:IL::C IDDAH

1DJG8:,R,=IIERXX9BK6I>DCX1DJG8:,R,=IIERXX9BK6I>DCX 'A@85:1 d $>-<4?(-:0(3>-<4(95:5:3 d 53 0-@-(-:-8E@5/? d 53 $>-<4(2>-91C;>7? d *;91 /;:@>5.A@5;:? d !;:/8A?5;: Graph Graph processing/mining

&'&' #>1=A1:@(?A.3>-<4(05?/;B1>E( d $>C9>C<,HJ7C,K>C<,6,B>C>BJB,HJEEDGI

#J3?B>7O

1 2 2 1 1 ' ) 2 % % ) ) 4 1 % + 1 1 $ 2 $ + $ +

%D3B:463F343E7 2 1 2 2 1 2 2 1 ) 1 1 ) ) 1 ) ) QP ) 1 1 1 1 1 1 $ $ $ $ [email protected]:E

U?;@;?G?4EGBBADF4`4\V ac 'A@85:1 d $>-<4?(-:0(3>-<4(95:5:3 d 53 0-@-(2>-91C;>7?H-:-8E@5/? d 53 3>-<4(2>-91C;>7?H-:-8E@5/? d *;91 /;:@>5.A@5;:? d !;:/8A?5;: Framework MapReduce : d2004 : Google dA programming model and not an algorithm dExecuted in a distributed environment dSeveral implementations such as Hadoop dBased on two functions (Map and Reduce) dBased on key-value format

&* Big Data Frameworks MapReduce :

Example : mothergooseclub.com d Peter Piper picked a peck of pickled pepers d A peck of pickled pepers Peter Piper picked d If Peter Piper picked a peck of pickled pepers d Where’s the peck of pickled pepers Peter Piper Picked ?

&+ Big Data Frameworks MapReduce / Hadoop : Google 2004 Strength : General-purpose framework, powerful and simple, open-source Limitation : General-purpose framework ! e.g. : Less efficient on relational data

Some solutions : Problem and data-specific d Pig Latin : Yahoo (2008) – Iterative data processing d Hive : Facebook (2009) – SQL-like workload d Pregel : Google (2010) – Graph processing. !. GraphLab (2011), PowerGraph (2012) ! Giraph : Facebook (2012) --- Iterative graph processing (Pregel open source) ! Trinity : Microsoft (2013) --- started 2010 d SpatialHadoop : U. Minnesota (2013) – Geospatial data processing

0U]N[WJ]R_N\%/ d Dryad : Microsoft (2007) – https://www.microsoft.com/en-us/research/project/dryad/ started in dec. 2004 d Quite expressive d Subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra d Commercial (not open-source) d … &, Big Data Analytics more …. d 1@3B; NGJQJ[RJ !$#! #!%CA4<8F%'%&% Q]]Y\/$$\YJ[T#JYJLQN#X[P$ d J%YX`N[O^U PNWN[JU Y^[YX\N Y[XLN\\RWP O[JVN`X[T ]QJ] Y[X_RMN\ JW%NJ\N XO%^\N%]XXU OX[% 677:4:6>D$3>3; Q]]Y\/$$OURWT#JYJLQN#X[P$ d 1DB3D?C@96B6 N0UNaJWM[X_% !$#! #!%D:31%9#%'%&) O^WMNM Kb%6N[VJW @N\NJ[LQ 5X^WMJ]RXW N356 d JW%XYNW%\X^[LN%O[JVN`X[T OX[%Y[XLN\\RWP MJ]J%RW%KX]Q [NJU%]RVN%VXMN%JWM%KJ]LQ%VXMN d 1D?B= NBX\QWR`JU N]JU!%A86;=3%'%&) % Q]]Y/$$\]X[V#JYJLQN#X[P$ B`R]]N[ d JW%XYNW%\X^[LN%O[JVN`X[T OX[%Y[XLN\\RWP UJ[PN%\][^L]^[NM JWM%^W\][^L]^[NM MJ]J%RW%[NJU%]RVN d ,&. N0UJPRJWWR\ N]JU#!%A86;=3%'%&) !% d J%\b\]NV%]QJ] K[RWP\ MJ]JKJ\N"URTN RW]N[JL]R_NWN\\ ]X%7JMXXY d MN_NUXY JW%JWJUb]RLJU RW]N[OJLN%OX[%LUX^M%LXVY^]RWP!%Y[X_RMRWP ^\N[\ `R]Q ]XXU\ OX[%MJ]J%JWJUb\R\ &- Big Data Analytics Comparisons d Q]]Y\/$$O[#\URMN\QJ[N#WN]$\KJU]JPR$OURWT"_\"\YJ[T d Q]]Y/$$```#VN]R\][NJV#LXV$LXVYJ[RWP"QJMXXY"VJY[NM^LN"\YJ[T"OURWT"\]X[V$

&. Big Data ML-DM

AYJ[T Big Data ML-DM

PARMA [Riondato etal , CIKM 2012]

d Tool for mining frequent itemsets and association rules d Run in a distributed mode d Based on MapReduce

'& Big Data ML-DM

SystemML [Ghoting etal , ICDE 2011] d Distributed framework d Provides solutions d Run on top of Hadoop d Based on a Declarative Machine Learning Language d High-Level Operator Component d Low-Level Operator Component

0%_N[\RXW%XW%AYJ[T HD:31%'%&+I '' Big Data Analytics

Mahout http://mahout.apache.org/ dSet of scalable machine learning algorithms dBased on Hadoop MapReduce dExamples:

!Recommender systems !Distributed Principal Components Analysis (PCA) techniques for dimensionality reduction !Clustering algorithms

'( 'A@85:1 d $>-<4?(-:0(3>-<4(95:5:3 d 53 0-@-(2>-91C;>7?H-:-8E@5/? d 53 $>-<4(2>-91C;>7?H-:-8E@5/? d *;91 /;:@>5.A@5;:? d !;:/8A?5;: Graph Processing Frameworks

JU\X /%B[RWR]b%N;RL[X\XO] !%6R[JYQ N5JLNKXXT !%e Graph Processing Frameworks Pregel dScalable and fault tolerant dVertex-oriented computing dFlexible API to use in various graphs processing problems dCompatible with the Hadoop ecosystem and master/slaves architecture. dBased on the communications and the exchange of between the vertices.

'+ Graph Processing Frameworks Running a Pregel program

1. Send the graph to master node. 2. Load the graph in the master (in GFS or ). 3. Assign sub-graph to workers candidates. 4. Each worker follows the execution of their partitions (communication, exchange of messages, ..) 5. The master follows the states of the workers to get the final results. 6. Send the final results to user

', Graph Processing Frameworks

GraphX dDistributed and In-memory processing framework dSub project of Spark dCompatible with many systems ( HDFS , ) dAllows persistence to the RAM dUses the @N\RURNW]%3R\][RK^]NM%3J]J\N]\%N RDD) concept dUses the Data Frame in the Spark project

'- Graph Processing Frameworks GraphX

>J[]R]RXW%& >J[]R]RXW%& DN[]RLN\ 4MPN\ 1 3 1 F1 1 2 F1 Worker 1 1 3 2 F2 1 3 F2 2 3 F3 2 3 F3 4 2 5 2 4 F4 4 5 F1 5 4 5 F5 4 2 F2 Worker 2 >J[]R]RXW%' 2 F2 5 2 F3 >J[]R]RXW%' DN[]RLN\ 4MPN\

'.