PS030112 withAnalytics theSystems HPCC Platform Lead of the the Lead Systemsof HPCC Initiative Presenter University of Florida : Dr. Flavio Villanustre

hpccsystems.com

Real Life Graph Analytics http://hpccsystems.com

Scenario This view of carrier data shows seven known fraud claims and an additional linked claim.

The Insurance company data only finds a connection between two of the seven claims, and only identified one other claim as being weakly connected.

WHT/082311

2 Real Life Graph Analytics http://hpccsystems.com

Task

After adding the LexID to the carrier Data, LexisNexis HPCC Family 1 technology then explored 2 additional degrees of relative separation

Result

The results showed two family groups interconnected on all of these seven claims. Family 2

The links were much stronger than the carrier data previously supported.

WHT/082311

3 Property Transaction Risk http://hpccsystems.com

Three core transaction variables measured

• Velocity

• Profit (or not) Flipping • Buyer to Seller Relationship Distance (Potential of Collusion)

Collusion

Profit

WHT/082311

4 Overview Property Transaction Risk http://hpccsystems.com

Derived Public Data ±700 mill Deeds Relationships from +/- 50 terabyte database

Data Factory Clean

Collusion Graph Analytics

Chronological Analysis of all property Sales Large Scale Suspicious Cluster Ranking

Historical Property Sales Person / Network Level Indicators and Counts Indicators and Counts

WHT/082311

5 Suspicious Equity Stripping Cluster http://hpccsystems.com

WHT/082311

6 Results http://hpccsystems.com

Large scale measurement of influencers strategically placed to potentially direct suspicious transactions.

• All BIG DATA on one supercomputer measuring over a decade of property transfers nationwide.

• BIG DATA Products to turn other BIG DATA into compelling intelligence.

• Large Scale Graph Analytics allow for identifying known unknowns.

• Florida Proof of Concept – Highest ranked influencers . Identified known ringleaders in flipping and equity stripping schemes. . Typically not connected directly to suspicious transactions. – Known ringleaders not the Highest Ranking.

• Clusters with high levels of potential collusion. • Clusters offloading property, generating defaults. • Agile Framework able to keep step with emerging schemes in real estate.

WHT/082311

7 Social Graph and Prescriptions http://hpccsystems.com

Scenario Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently. Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems.

Providers recruit members to provide and escalate services that are not rendered.

Task Given a large set of prescriptions. Calculate normal social distributions of each brand and detect where there is an unusual socialization of prescriptions and services.

Result The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk.

WHT/082311

8 The Data/Information flow http://hpccsystems.com

Cyber Security

Financial Services

Government

Health Care Big Insurance Data Online Reservations Retail

Telecommunications

Transportation & Logistics

Weblog Analysis

INDUSTRY SOLUTIONS

Customer Data Integration Data Fusion Fraud Detection and Prevention Know Your Customer Master Data Management Weblog Analysis

. High Performance Computing Cluster Platform (HPCC) enables data integration on a scale not previously available and real-time answers to millions of users. Built for big data and proven for 10 years with enterprise customers. . Offers a single architecture, two data platforms (query and refinery) and a consistent data-intensive programming language (ECL) . ECL Parallel Programming Language optimized for business differentiating data intensive applications

WHT/082311

9 The Open Source HPCC Systems platform http://hpccsystems.com

• Open Source distributed data-intensive computing platform • Shared-nothing architecture • Runs on commodity computing/storage nodes • Binary packages available for the most common distributions • Provides end-to-end Big Data workflow management , scheduler, integration tools, etc. • Originally designed circa 1999 (predates the original paper on MapReduce from Dec. ‘04) • Improved over a decade of real-world Big Data analytics • In use across critical production environments throughout LexisNexis for more than 10 years

WHT/082311

10 Components http://hpccsystems.com

• The HPCC Systems platform includes: • Thor: batch oriented data manipulation, linking and analytics engine • Roxie: real-time data delivery and analytics engine • A high level declarative data oriented language: ECL • Implicitly parallel • No side effects • Code/data encapsulation • Extensible • Highly optimized • Builds graphical execution plans • Compiles into C++ and native machine code • Common to Thor and Roxie • An extensive library of ECL modules, including data profiling, linking and

WHT/082311

11 The Three HPCC components http://hpccsystems.com

1 HPCC Data Refinery (Thor) • Massively Parallel Extract Transform and Load (ETL) engine • Enables data integration on a scale not previously available • Programmable using ECL

2 HPCC Data Delivery Engine (Roxie) • A massively parallel, high throughput, query engine • Low latency, highly concurrent and highly available • Allows multipart (compound) indices for efficient retrieval • Programmable using ECL

3 Enterprise Control Language (ECL) • An easy to use, declarative data-centric programming language optimized for large-scale data management and query processing • Highly efficient; automatically distributes workload across all nodes. • Automatic parallelization and synchronization

Conclusion: End to End platform • No need for any third party tools

WHT/082311

12 The HPCC Systems platform http://hpccsystems.com

WHT/082311

13 Detailed HPCC Architecture http://hpccsystems.com

WHT/082311

14

Enterprise Control Language (ECL) http://hpccsystems.com

Declarative programming language: Describe what needs to be done and not how to do it Powerful: High level data primitives as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Extensible: As new attributes are defined, they become primitives that other programmers can use Implicitly parallel: Parallelism is built into the underlying platform. The programmer needs not be concerned with it Maintainable: A high level programming language, without side effects and with efficient encapsulation, programs are more succinct, reliable and easier to troubleshoot Complete: ECL provides for a complete data programming paradigm Homogeneous: One language to express data algorithms across the entire HPCC platform, data integration, analytics and high speed delivery

WHT/082311

15

Enterprise Control Language (ECL) http://hpccsystems.com

• The ECL optimizer generates an optimal execution plan • Subgraphs run activities in parallel, automatically persist the results and spill to disk as needed • Lazy evaluation avoids re-computing results when data and code haven’t changed • Graphs are dynamic and display the number of records traversing the edges and data skew percentages

WHT/082311

16 Machine Learning on HPCC http://hpccsystems.com

. Extensible Machine Learning Library developed in ECL

. Fully distributed across the cluster

. Supports PB-BLAS (Parallel Block BLAS)

. General statistical functions

. Supports supervised, semi-supervised and unsupervised learning methods

. Document manipulation, tokenization and statistical Natural Language Processing

. A consistent and standard interface to classification (“pluggable classifiers”)

. Efficient handling of iterative algorithms (for example, for numeric optimization and clustering)

. Open Source and available at: http://hpccsystems.com/ml WHT/082311

17 ECL-ML: extensible ML on HPCC http://hpccsystems.com

General aspects . Based on a distributed ECL linear algebra framework . New algorithms can be quickly developed and implemented . Common interface to classification (pluggable classifiers) ML algorithms . Linear regression . Several Classifiers . Multiple clustering methods . Association analysis Document manipulation and statistical grammar-free NLP . Tokenization . CoLocation Statistics . General statistical methods . Correlation . Cardinality . Ranking

WHT/082311

18 ECL-ML: extensible ML on HPCC (ii) http://hpccsystems.com

Linear Algebra library . Support for sparse matrices . Standard underlying matrix/vector data structures . Basic operations (addition, products, transpositions) . Determinant/inversions . Factorization/SVD/Cholesky/Rank/UL . PCA . Eigenvectors/Eigenvalues . Interpolation (Lanczos) . Identity . Covariance . KD Trees

WHT/082311

19 Machine Learning on HPCC http://hpccsystems.com

. ML on a general-purpose Big Data platform means effective analytics in-situ

. The combination of Thor and Roxie is ideal when, for example, training a model on massive amounts of labeled historical records (Thor), and providing real-time classification for new unlabeled data (Roxie)

When applying Machine Learning methods to Big Data: data profiling, parsing, cleansing, normalization, standardization and feature extraction represent 85% of the problem!

WHT/082311

20 The Future: Knowledge Engineering http://hpccsystems.com

At the end of the day . Do I really care about the format of the data? . Do I even care about the placement of the data? . I do care (a lot!) about what can be inferred from the data . The context is important as long as it affects my inference process . I want to leverage existing algorithms

ECL KEL Generates C++ (1->100) Generates ECL (1->12) Files and Records Entities and associations Detailed control of data format Loose control of input format; none of processing Can write graph and statistical algorithms Major algorithms built in Thor/Roxie split by human design Thor/Roxie split by system design Solid, reliable and mature R&D WHT/082311

21 KEL by example (WIP!) http://hpccsystems.com

• Actor := ENTITY( FLAT(UID(ActorName),Actor=ActorName) ) • Movie := ENTITY( FLAT(UID(MovieName),Title=MovieName) ) • Appearance := ASSOCIATION( FLAT(Actor Who,Movie What) )

• USE IMDB.File_Actors(FLAT,Actor,Movie,Appearance)

• CoStar := ASSOCIATION( FLAT(Actor Who,Actor WhoElse) )

• GLOBAL: Appearance(#1,#2) Appearance(#3,#2) => CoStar(#1,#3)

• QUERY:FindActors(_Actor) <= Actor(_Actor) • QUERY:FindMovies(_Actor) <= Movie(UID IN Appearance(Who IN Actor(_Actor){UID}){What}) • QUERY:FindCostars(_Actor) <= Actor(UID IN CoStar(Who IN Actor(_Actor){UID}){WhoElse}) • QUERY:FindAll(_Actor) <= Actor(_Actor),Movie(UID IN Appearance(Who IN _1{UID}){What}),Actor(UID IN CoStar(Who IN _1{UID}){WhoElse})

WHT/082311

22 PS030112 Assorted examples • • • • • • • • • • • • • • • • • Python integrationPython Java integration integrationR Data Streaming/Kafka SQL/JDBCintegration ViewSmart Graphtraversal problemhandling (IMDB/Kevin Baconexample) Sentiment Analysis example Attribute manager Scoring/statistical models Full text classifiers Search and retrieval Recommendationsystems Exploratorydata analysis Data Visualization Recordlinkage, entity disambiguation, entity resolution using SALT Data integration Flowusing HPCC

23

FC072911 Data Integration

24

PS030112 Data Integration using HPCCFlow GeneratesECL code so itcan be used also asalearning tool Drag and drop interfaceto data workflows onHPCC

on Design Primitives shown with the ECL Simple workflow,

25

PS030112 Data Integration using HPCCFlow

calling child a job A parent job

26

PS030112 Get the source Getcode:the Data Integration using HPCCFlow https://github.com/hpcc - systems/spoon - plugins

by aparent job Child job called

27

FC072911 Record linkage, entity disambiguation, entity resolution

28

PS030112 Scalable AutomatedTechnologyLinking (SALT) . . . . . Significant productivity boost! clustering Sophisticatedspecificitybased linking and standardization normalizationparsing, cleansing, Providesfor automateddata profiling, QA/QC, Templatebased ECL code generator Linking Technology” The acronym stands for “Scalable Automated

D

a and and t R a e

S c o o u r

d r c

M L e i s a a n C n t k c o d a h m

g T i n e p h g

u P r D

e t W r a a s o t t e h c P i a o i o e

g r P n l s o h d r s f t

e i e s l D p i s

A n a a d g r t a d a t

i I i t n o i o g n n e

P a s l r t

o c S e B s e l o s a P e c r a c s k r h s

i ( n i i E n n g T g g / L ) W C e L 482,410 Lines of C++ Lines 482,410 l e i i n C g a a k h o n n i 3,980 Lines of ECL3,980Lines n t m d s

g A

i

42 Lines of SALTof 42 Lines p R n I t s e a g e s r r c a i i g t s o i

o n o r n d m n s

e n t

N o r m a l i R z a e t D c i o o e n r c d i s

M i o a n t c h S

t a n d a

r d i

L z a i n t i

k o F e n i d l e

D a t a

29

PS030112 1. 2. Thoroughlyemployingstatistical algorithms, INPUT INPUT Rules basedvs. probabilisticrecord linkage . . . . matches SALT sufficiently exploreallcredible matches SALT effectively minimizes false J Fl JohnSmith, Atlanta JohnSmith, Atlanta

avio avio

Villanustre,Atlanta

Villanustre,Atlanta

律商 联讯

®

Record 1 Record 2

Error Error SALTtwothe significant has advantages over rules: RULES

RULES

SALT SALT

“JohnSmith“ andthe city for bothrecordsthe match NO Match occurrence small is andthere is only one presentin “Villanustre” is specific “Villanustre”because the frequency of occurrence large is andthere are manypresent in Smith” is NO MATCH Match MATCH “ Flavio , becausesystem the learnthas that “John , because the system learnthas that , becausethedetermine rules that not “ and“ , becauserules determinethe that

specific becausethe frequency of Javio Atlanta Atlanta ” are” thenot same

30

PS030112 Technology Record Linkage: SALT Another of SALTpurpose is to perform record linking

Scalable Automated Linking

31

FC072911 Data Visualization

32

PS030112 The ExcelThe code Basic Visual retrieveto data: Access a RoxieAccess to display query a dynamic data dashboardfrom Excel:within End End Function Function Application.EnableEvents Call [Turnover]. [Turnover]. Application.EnableEvents

Data Visualization: Excel Integration refreshdata

QueryTable.Connection QueryTable.Refresh ()

= True= False=

= " = URL;http ://10.242.99.99:8002/

WsEcl / xslt /query/ roxie

/ excelqueries

? asofdate =" [ + qrydatecell ].Text

33

PS030112 published Roxie the Selecting in Excel:Query Data Visualization: Excel Integration…

34

PS030112 Data Visualization: Desktop visual reportdesigner Reports canReports savedbe locally published a toor

Pentaho Report Designer BIServer call roxie::call RoxieQuery: ${ from month_of_year quarter, select Adhoc byaRoxie calling query. Reports/Charts cancreatedbe acustom writing by or query org.hpccsystems.jdbcdriver.HPCCDriver Custom DriverClass Name: = DirectPort jdbc:hpcc:ServerAddress CustomURL: Connection Type: Connection designer directory Driver in JDBC the CopyHPCC true:PageSize pQtr foodmart

} SQL Query: SQL

\ lib all_cancers_for_state =8008:EclResultLimit=100:QuerySet= \

jdbc and specify the following values: specify the and following jdbc =100:TraceLevel= ::agg_c_10_sales_fact_1997 where::agg_c_10_sales_fact_1997 quarter = store_sales

Parameter

Generic Generic Database

=99.999.999.9;Cluster= , unit_sales

(GA) ALL:TraceToFile

, store_cost,the_year

\ report =true: thor:LazyLoad default:WsECL -

, 35

PS030112 Data Visualization: Pentaho Report Download: Download: http://reporting.pentaho.com/report_designer.php Selection Parameter

Designer

36

PS030112 response times tocomplexresponse queries. analytical Usersexplore databusiness by drilling into and cross Real Data Visualization: - time graphical Analytical online HPCC on Processing(OLAP) Download: Download: Mondrian usingSaiku UI

http://meteorite.bi/saiku - tabulating information speed with

- of - thought thought 37

PS030112 Eclipse Data Visualization: • • DataBIRT Source Types: Web (RoxieService queries) (forJDBC - based open sourcereportingsystem open based HPCC with AdHoc

queries Roxiequeries or queries)

Types DataSource

Source DataJDBC

BIRT

as DataSourceas WebService

38

PS030112 Data Visualization: Download: BIRT Example Output http://download.eclipse.org/birt/downloads/

39

PS030112 • • • results, the such as: of This “Cell allowsperformFormatting” touser some the “JavaScript”,rather than defaultthe plain text.example See ECLsnippet below.code ECL In Watchthe Tech Preview, users specific columnstocan contain output specify “HTML” or • • • Visualizations to:used can be Data Visualization: Result Cell Formatting

with lat. lat. with and long. etc. Hyperlinks:Embed Cell Add textual formatting, e.g.: Summarize data. Demonstratedata/products Understanddatanew color :

Make acell Red/Yellow viewerthat the to alertcertainrows areimportant.

Embed a link a link toEmbed apreformatteda search,map link ora Google to Google

d; dataset([{ d := d3Chart d3.Chart('cars',:= 'name', 'unused'); end; recordr := import Bundles.d3; import Text(bold text)

varstring dataset(

SampleData.Cars

Cars.CarsRecord

pc__javascript Cars.CarsDataset

;

; ) cars;

, d3Chart.ParallelCoordinates}], r);

“__ a trailing or “__html” column with the Name javascript

40

PS030112 Data Visualization: Result Cell Formatting Output

41

PS030112 Chart Data Visualization: Result Cell Formatting

Basic Basic 42

PS030112 Structures Data Visualization: Result Cell Formatting

Tree 43

PS030112 Data Visualization: Result Cell Formatting –

Graphs

44

FC072911 Exploratory Data Analysis

45

PS030112 randomrecords existing in used to be tools SAS.and analytic R such as abilityto the “Big reduceData”a problemtostatisticallydown a significantsubset of EDA toolsprovideexistingnot willof allthe functionality is benefit tools additionalso an EDAThe tooltoolsdata. toFlow setthatbe on big willadded HPCC allowanalytics ExploratoryData Analysis • • • • • • • • EDA Procedures variables Correlationstatistics betweentwo existing columnsof data createdmore orby one doing math on to add dataOptions easily columns of observation Simple math as opssuch x/y foreach generatorsnumber Random Tabulate procedure and descending ascending Rank Max Deviation, Percentiles, Min, Mean, Median, StandardMode, Frequency

(Essentially (Essentially ProcUnivariate)

• • • • • • EDA Visualization and and Histogram Graph will options Bar, include Line, Pie plotted against X and Y axis the to defineGUI will allow userwhichdata is charts Pie Histograms Plots Line ScatterPlots

The The 46

FC072911 SentimentAnalysis

47

PS030112 and probabilities,and and general inductiveinference related problems. text and document analysis, learning, statisticsunsupervised covering and supervised Matrix and processing(ML) algorithms Learning to assistintelligence; business with SystemsHPCC its ECL and 3. 2. 1. Sentiment Analysis: ECL as shown in in example:the as shown source using statementa import Reference librarythe in ECLyour sourcefile ECLto folder. IDE the Extract contentsthe zip of the http://hpccsystems.com/ml the ML LibraryDownload - ML libraryextensibleML an includes parallel fully set of Machine

- ML OUTPUT(x3); x3 := {3, 1,9}, {3, 2, 4}], c := DATASET([ {1, 1, 1}, {1, 2, 5}, {2, 1,5}, {2, 2, 7}, 4}], 3}, {6, 1, 1}, {6, 2, 4}, {7, 1, 9}, {7, 2, {3, 1,8}, {3, 2, 1}, {4, 1,0}, {4, 2,0}, {5, 1, 9}, {5, 2, x2 := DATASET([ {1, 1,1}, {1, 2,5}, {2, 1, 5}, {2, 2,7}, IMPORT * FROM IMPORT * FROM IMPORT * FROM ML; NumericField

Kmeans

(x2,c); ); ML.Types ML.Cluster NumericField

;

;

);

48

PS030112 sentimenta of othervalueor some ML.Classify 3. 2. 1. three logical steps: Using a classifier in ML involves ECL classification newtodata ordergivetoin it a Apply Classifying. classifier the classifier well fits the how Testing. Getting measuresof classified been externally training datasetof that has from modela the Learning - ML Classification: Naïve Bayes Classification

tacklesproblem:“given the can factsI knowthese I predict object; an about tweet

attributeof is positive or .

.

that object.” Forexample, canI predictwhether OUTPUT(Results); Results:= //Classifying OUTPUT( TestModule // Testing Model := // Learningthe model BayesModule */ ML.Docs.Trans dLexicon dTokens dRaw classification trainingset: /* Use IMPORT ML; negative?

- ML.Docs

collection of positive andnegative sentiment tweets

:=

TestModule :=

BayesModule.LearnD BayesModule.ClassifyD ML.Docs.Tokenize.Split ML.Docs.Tokenize.Lexicon :=

:= ( BayesModule.TestD ML.Docs.Tokenize.ToO

module to pre ML.Classify.NaiveBayes

);

IndependentDS - ( process tweettext andgenerate IndependentDS ( ( IndependentDS ( ML.Docs.Tokenize.Clean IndependentDS ( ( dTokens,dLexicon dTokens ;

, ClassDS ); ,

ClassDS , Model ,

, ClassDS )). ); WordsCounted );

(

dRaw );

));

; ; 49

PS030112 negative sentiment. Sentilyze tweets Twittervia the API real in Enter and receivea search realquery Classification Sentiment Analysis: “

uses Systemsuses HPCC and its ML

- time. - time

Sentilyze - Libraryto classify tweets withpositive or ” Twitter Sentiment OUTPUT( OUTPUT( nbSentiment kcSentiment processTweets rawTweets DATASET('~SENTILYZE:: Tweets := IMPORT sentimentclassifiers: classified usingboth Keyword CountandNaïve Examplecode of dataseta of tweets thatare https://github.com/hpcc Get the source code: http://hpccsystems.com/demos/twitter Try the demo: Sentilyze nbSentiment,NAMED kcSentiment,NAMED

:= :=

:= := := := Sentilyze.PreProcess.ConvertToRaw

:= := Sentilyze.KeywordCount.Classify Sentilyze.NaiveBayes.Classify ; Sentilyze.PreProcess.RemoveAnalysis

TWEETS',Sentilyze.Types.TweetType.CSV

(' (' TwitterSentiment_KeywordCount TwitterSentiment_NaiveBayes - systems/ecl ( processTweets ( - processTweets ml (Tweets); -

sentiment ( rawTweets

Bayes );

')); ); ); 50

'));

)

FC072911 Graph TraversalProblem Handling

51

PS030112 http://hpccsystems.com/download/docs/six Get the example code: problems forcustomerfinding detectionof importantand insight data patterns trends. and DataanalysisHPCCSystems techniques using toECLapplied socialgraph canbe traversal Graph Traversal ProblemHandling

- degrees

aka “ SeparationLinksDegrees of using and GettingInformationUseful Datafrom KevinexampleBacon

52

FC072911 Smart ViewSmart

53

PS030112 resolution challenges. a compelling option for the data scientist or business analyst faced with complex entity interface, along with its leveraging a patented,of statistically requiring fusion for and time mission organizations Smart View™ (SV) generates comprehensive profiles ‘entities’of Smart Smart View™ Entity Resolution &Visualization Linking Linking Iteration Records Individual Source –

from a ‘perfect storm’ big of data (

1

- critical applications). The combination of SV’s intuitive user 2

3

ie , disparate, complex and massive datasets 4

- based clustering technique, make it – formulatingin an

acrossiterations4 of the identity analyzed.of being people, people, places, and recordsweretogether fused depicting how 12 individual how 12 depicting This Smart View This Smartscreenshot containsa visualization

- depthprofile –

ultimately

54

PS030112 relationship visualizations that display dynamically according and to ‘point’ ‘click’commands. scientist in interpreting what their link analysis results signify, Smart View a includes set of statistical process to(similar how SV’s entity resolution problem is solved). To aide the data user to define a series relationships,of then uncover such relationships through an iterative, linkages between them will be uncovered. Smart View’s link analysis feature allows the end The more comprehensive your entity profiles are, the greater the probability that non Smart Smart View™ Link Analysis Visualization & linkages? These are justa sample degrees of separation?degrees of Forthose How manyHow areindividuals linked address(forKevinany or Bacon, separation,are‘strong’ thehow other entityother forthat matter of the many questionsof SV’s link analysis featureusershelp can actorsdegree of with one to Kevin3 through Bacon

- obvious - 

55 )

FC072911 SQL/ JDBC Integration

56

PS030112 your data without the need to write a single datayour writetoa single line need the ofECL! without This allows youto connecttoHPCC Systems the platform through yourfavorite clientJDBC and access queries. DriverJDBC provides HPCCThe SQL SQL/JDBC Integration • • • • • writeECL! Leverage dataHPCC and to client learn need and JDBCfunctionality without Createsentry covers the under HPCC full of the powerHarnesses syntaxCALL orSELECT simple SQL Supports Analyze HPCCdata using manyClient Interfaces JDBCUser

     Opens the door forprogrammersnon ECLdoor the Opens data.HPCC accessto Automatic Index fetching capabilities forquicker data fetches compiled, andexecutedyourtarget on cluster Submittedrequest generatesSQL ECL is which codesubmitted, Stored published Proceduresas DB Accessqueries dataHPCC Access Tablesfiles as DB

- point forprogrammatic data access - based access toSystems access based HPCC data published and files

57

PS030112 Class. public

SQL/JDBC Integration: Example Code

public

} {

forName

class

} catch try Connection con=

static con=

Cluster= {

HpccConnect

(Exception e){ //1. Initialize HPCC ConnectionInitialize //1. (

" } con.close } System. ResultSet pst.clearParameters DriverManager. } Close connection //5. the while Loopthrough //4. all the and records the information write consoleto the the headers the consolePrintto //3. PreparedStatement the Query select Write to //2.

org.hpccsystems.jdbcdriver.HPCCDriver

void

e.printStackTrace System. default:WsECLDirectPort

( main(String[] rs.next out (); null rs { .println

= out

pst.executeQuery ;; ()) {

.println getConnection

( " args (); conversion_ratio pst ();

( rs.getObject

= ) con.prepareStatement

=8010:EclResultLimit=100:QuerySet= ( (); foodMart " jdbc:hpcc:ServerAddress

(1)+

\ t \ http://hpccsystems.com/products Download JDBC Driver: " tcurrency_id " ::currency table \ ); t \

t"

+ rs.getString ( "select \ t \ tcurrency conversion_ratio,currency_id,currency

=99.99.99.99; (2)+ " \ t \ " t" ); thor:LazyLoad

+

rs.getString

(3)); = true:PageSize

- and - services/products/plugins/JDBC

=100:LogDebug=true:" from foodmart ::currency" , "" , "" );

);

- Driver 58

FC072911 Data Streaming

59

PS030112 Data Streaming: Apache Kafka • • • • following:support the Kafka Apache distributed isa publish orderingsemantics. consumptionovera clustermaintaining while of consumerper machines Explicit forsupport messages partitioningover Kafka serversdistributing and ofmessages thousands per second. High performanceevenmany with ofstored TB messages. Persistentdisk structuresO(1) with messaging that provideconstant time HorizontallyScalable

- throughput: evenvery throughput:with modest hardwareKafka of hundreds support can

-

subscribe messaging system.messaging subscribe to It isdesigned

-

partition 60

PS030112 Data Streaming: How we are using Apache Kafka

61

FC072911 R Integration

62

PS030112 rHPCC ECLProject ECLPROJECT(CODE:Example # record turn. each in on transform the # function records the all in through processes PROJECT function The # Creates# ECL"PROJECT" an definition. R Integration:

",",def,));") " outECLRecord

) is an R packageR isan providinginterfacean Systems.HPCC and betweenR

< -

setRefClass

methods = list(= methods =" ) } { function() = print }, addField }, getName

result < result << def < result

ECLRecord

-

- - paste(def, " ", id, " := ", value, ";")value,paste(def, ", := " id, ", " ("

= function(id, value) { value) function(id, =

= function() { function() = name paste (name, " := PROJECT( ",:= " (name, paste ECLProject ", def = "character"),= def ",

recordset,TRANSFORM

rHPCC

", contains=" ",

Plugin ECLDataset ( record,SELF inDataset$getName

recordset ", fields = list(name = "character",= list(name = fields ",

:= LEFT)); :=

performing (), ", TRANSFORM(", TRANSFORM(", ", (),

inDataset outECLRecord$getName

=" ECLDataset ", () , () 63

PS030112 ECL Code Snippet using EMBED(R) using SnippetECL Code R Integration: Code Example SET OF INTEGER2 1: EXAMPLE IMPORT R; result; // WorldHello result :=cat('Hello', 'World'); ENDEMBED; STRING cat(VARSTRING what, VARSTRING who) :=EMBED(R) 2: EXAMPLE arr arr ENDEMBED; paste( [3]; // Returns 5 sort( := sortArray what,who val

);

// R code// R here

([ - ) sortArray 111, 0, 113, 12, 45, 5]);12,0,113,45, 111,

// Rcode// here

( SET OF INTEGER2

val ) :=EMBED(R)

64

FC072911 Java Integration

65

PS030112 SystemsPlatform programming from another language in general. statusserver updates. goal secondary projectto tois Athis provide examples using HPCC the on handle generating ECL calling code,the Clusterthe SOAP via the interface, and retrieving results and Java JAVAtool building API ECLFlowwasHPCC necessary the developedalong the side tobase code Java Integration: To checkECL the code a with local compiler call: JavaECL API Examplenowith client

} } if( String boolean boolean if(! String isValid String ArrayList boolean ECLSoap // String if( }

getECLSoap es.getErrorCount es.getWarningCount this.isValid errorMessage wuid outStr eclCode

=

isWarning isError isValid es es.executeECL dsList

= =

= "";= getECLSoap () () is a class used that sets up the server connection and returns a es.getWuid ){

= “output(‘Hello World’);”;

= null; this.error = false;= = false;= isWarning isError

() 0){ > = false;=

= (

( () 0){ > es.syntaxCheck = true; eclCode (); ();

Java ECL API += "

= true;=

side compile check \

r \ ); nFailed

( eclCode

to execute code on the cluster, please verify your settings

)).trim();

ECLSoap

class submit tosubmit cluster Compile and

\ r \ n";

66

PS030112 • • • • • • • • • • • • • Java Integration: Java ECL API Merge Loop Limit Join Iterate Index Group FromField Distribute DeSpray Dedup Dataset Count ECLPrimitive

• • • • • • • ToField Table Spray Sort Rollup Project Output

• • • • • InternalLinking HygieneReport Specificities Concepts DataProfile Report SALT

Code Generators

• • • • • • • Naïve Bayes KMeanes Discretize Classify Classification Classify Build Associate ML LibraryML

67

PS030112 Generator Java Integration: Java ECL API Render ECLExample Resulting ECLCode String join.setRightRecordSet join.setLeftRecordSet join.setJoinType join.setJoinCondition join.setName Join myRecSet join joinResults

=

new Join(); := join( := ( “

myRecSet ( “INNER”);

MyLeftRecordSet,MyRightRecordSet,left.theID = join.ecl();

(" (

“ ( left.theID MyLeftRecordSet “ MyRightRecordSet ”);

” = "+" = right.theID ”);

”);

"); ECL Primitive Code

= right.theID , INNER);,

68

PS030112 Data Profile Example Java Integration: Java ECL API

SALT Code Generator

69

PS030112 Data Profile Example Result Set Java Integration: Java ECL API

https://github.com/hpcc Getsourcethe code: http://hpccsystems.com/demos/data Try demo:the –

SALT Code Generator

- systems/java - profiling - ecl - api

- demo

70

FC072911 Python Python Integration

71

PS030112 ECL Code Snippet using EMBED(Python) using SnippetECL Code Python Integration Python : Code Example result[1]; // Returns ‘green’ result := ENDEMBED; SET OF STRING 2: EXAMPLE arr arr ENDEMBED; INTEGEROF SET 1: EXAMPLE IMPORT Python; returnsorted returnsorted [3]; // Returns 5 := sortArray sortArrayOfString

sortArrayOfString

(

([ val

- ( sortArrayOfIntegers 111, 0,113, 12,111,5]);45, val ) )

) ) // Python // Python code here // Python code herecode // Python

(['

red','green','yellow

(SET OF STRING

(SET OF INTEGER OF (SET

val ) := ']);

EMBED(Python)

val ) :=

EMBED(Python)

72

What’s next? http://hpccsystems.com

. Version 5.0 Beta will be out soon.

. Ongoing R&D (5.2 and beyond):

. Level 3 BLAS support

. More Machine Learning related algorithms

. Heterogeneous/hybrid computing (FPGA, memory computing, GPU)

. Knowledge Engineering Language driving ML

. General usability enhancements (new abstractions, enhanced user interfaces)

. Additional capabilities for embedded languages

. More integration with other tools and systems WHT/082311

73 Useful links http://hpccsystems.com

. LexisNexis Open Source HPCC Systems Platform: http://hpccsystems.com

. Free Online Training: http://learn.lexisnexis.com/hpcc Upcoming Event!

. Machine Learning portal: http://hpccsystems.com/ml Join us during Big Data Atlanta Week for a Meetup! . The HPCC Systems blog: http://hpccsystems.com/blog EDA Toolkit for Data Scientists . Community Forums: http://hpccsystems.com/bb May 6, 2014 at 4:30pm LexisNexis Alpharetta office, . Our GitHub portal: https://github.com/hpcc-systems 1000 Alderman Drive

. JIRA: https://track.hpccsystems.com RSVP: http://www.meetup.com/Atla ntas-Big-Data-Week- 2014/events/169591702/

WHT/082311

74 Trivia! http://hpccsystems.com

Answer the question and win a prize!

• What is the name of the data refinery engine that provides batch oriented data manipulation? • What is the name of the data delivery engine that provides real-time analytics? • What is the query processing language supporting the HPCC Systems platform? • What can be used to generate ECL code for automating data profiling, parsing and cleansing? • Name three machine learning algorithms supported in ECL-ML. • What does KEL stand for?

WHT/082311

Distributed Machine Learning with the HPCC Systems Platform 75 Questions??? http://hpccsystems.com

WHT/082311

76 Appendix http://hpccsystems.com

WHT/082311

77 Terasort Benchmark results http://hpccsystems.com Execution Time Productivity Space/Cost (seconds)

140

120 Previous leader 100 80

60

40 20 0 HPCC HPCC Previous leader

WHT/082311

78 Medicaid Case Study http://hpccsystems.com

Scenario Proof of concept for Office of the Medicaid Inspector Generation (OMIG) of large Northeastern state. Social groups game the Medicaid system which results in fraud and improper payments.

Task Given a large list of names and addresses, identify social clusters of Medicaid recipients living in expensive houses, driving expensive houses.

Result Interesting recipients were identified using asset variables, revealing hundreds of high-end automobiles and properties.

Leveraging the Public Data Social Graph, large social groups of interesting recipients were identified along with links to provider networks.

The analysis identified key individuals not in the data supplied along with connections to suspicious volumes of “property flipping” potentially indicative WHT/082311 of mortgage fraud and money laundering

79

Network Traffic Analysis in Seconds http://hpccsystems.com

Scenario Conventional network sensor and monitoring solutions are constrained by inability to quickly ingest massive data volumes for analysis - 15 minutes of network traffic can generate 4 Terabytes of data, which can take 6 hours to process - 90 days of network traffic can add up to 300+ Terabytes

Task Drill into all the data to see if any US government systems have communicated with any suspect systems of foreign organizations in the last 6 months - In this scenario, we look specifically for traffic occurring at unusual hours of the day

Result Horizontal axis: time on a logarithmic scale In seconds, the HPCC sorted through months of network traffic Vertical axis: standard deviation (in hundredths) to identify patterns and suspicious behavior Bubble size: number of observed transmissions

WHT/082311

80 Wikipedia Graph Demo http://hpccsystems.com

Scenario Calculate Google Page Rank to be used to rank search results and drive visualizations.

Task Load the 75GIG English Wikipedia XML snapshot. Strip page links from all pages and run 20 iterations of Google Page Rank Algorithm. Generate indexes and Roxie query to support visualization.

Result Produces +- 300 million links between 15 million pages. Page Rank allows for ranking results in searching and driving more intuitive visualizations.

Interactive Visualization uses calculated page rank to define size and color of the nodes. Advanced Lays a foundation for advanced graph algorithms that combine ranking, Natural Language Processing and Machine Learning in scale. WHT/082311

81 Wikipedia Pageview Demo http://hpccsystems.com

Scenario 21 Billion Rows of Wikipedia Hourly Page view Statistics for a year.

Task Generate meaningful statistics to understand aggregated global interest in all Wikipedia pages across the 24 hours of the day built off all English Wikipedia page view logs for 12 months.

Result Produces page statistics that can be queried in 1 Steve_Jobs seconds to visualize which key times of day each 2 Whitney_Huston Wikipedia page is more actively viewed. 3 Wikipedia:SOPA_initiative/Learn_more 4 List_of_HTTP_status_codes%231xx_Informational Helps gain insight into both regional and time of day 5 Adele_(singer) key interest periods in certain topics. 6 Bruno_Mars This result can be leveraged with Machine Learning 7 Kim_Jong-il 8 Jeremy_Lin to cluster pages with similar 24hr Fingerprints. 9 Heavy_D 10 Christopher_Columbus 11 Murder_of_Meredith_Kercher 12 Eli_Manning 13 Jorge_Luis_Borges WHT/082311

82 Social Graph Analytics - Collusion http://hpccsystems.com

• LexisNexis Public Data Social Graph (PDSG) • Public Data relationships. • High Value relationships for Mapping trusted networks. • Large Scale Data Fabrication and Analytics. • Thousands of data sources to ingest, clean, aggregate and link. • 300 million people, 4 billion relationships, 700 million deeds. • 140 billion intermediate data points when running analysis. • HPCC Systems from LexisNexis Risk Solutions • Open Source Data Intensive high performance supercomputer. (http://hpccsystems.com) • Innovative Examples leveraging the LexisNexis PDSG • Healthcare. • Medicaid\Medicare Fraud. • Drug Seeking Behavior • Financial Services. • Mortgage Fraud. • Anti Money Laundering. • “Bust out” Fraud. • Potential Collusion (The value in detecting non arms length transactions)

WHT/082311

83 Connected Legal Content Universe http://hpccsystems.com

Docket Legislative Jury Legal Topics Number Instructions Branch Verdicts & Settlements Judicial / Executive & Court Branch Administrative Orders & Statutory Courts Decisions Code & Attorney Legislation Admitting Briefs, Authorities Pleadings & Motions Legal Matter Executive Jurisdiction Branch Government Official Area of Practice Dockets Newsworthy Person Judge Armed Service Branch Case Law Public Records Regulatory Attorney Bodies News Law Firm

Regulatory Register Education Exchanges & Exchange Commercial Treatises & Institutions Expert Codes Organization Analytical Witness Materials Analysis Subject Non-profit Industries & Organization Markets Industry Bar Associations Product

Scientific Expert Witness

User/Custo Geography mer Created Intellectual Company Content Property Information Expertise

WHT/082311

84