<<

Invited talk at Wright State University Open SolvingBig Data Problemswith the SourceHPCC Systems Platform Presenter Sr

Architect : Dr.John Holt

hpccsystems.com

Real Life Graph Analytics http://hpccsystems.com

Scenario This view of carrier data shows seven known fraud claims and an additional linked claim.

The Insurance company data only finds a connection between two of the seven claims, and only identified one other claim as being weakly connected.

WHT/082311

2 Real Life Graph Analytics http://hpccsystems.com

Task

After adding the LexID to the carrier Data, LexisNexis HPCC Family 1 technology then explored 2 additional degrees of relative separation

Result

The results showed two family groups interconnected on all of these seven claims. Family 2

The links were much stronger than the carrier data previously supported.

WHT/082311

3 Property Transaction Risk http://hpccsystems.com

Three core transaction variables measured

• Velocity

• Profit (or not) Flipping • Buyer to Seller Relationship Distance (Potential of Collusion)

Collusion

Profit

WHT/082311

4 Overview Property Transaction Risk http://hpccsystems.com

Derived Public Data ±700 mill Deeds Relationships from +/- 50 terabyte database

Data Factory Clean

Collusion Graph Analytics

Chronological Analysis of all property Sales Large Scale Suspicious Cluster Ranking

Historical Property Sales Person / Network Level Indicators and Counts Indicators and Counts

WHT/082311

5 Suspicious Equity Stripping Cluster http://hpccsystems.com

WHT/082311

6 Results http://hpccsystems.com

Large scale measurement of influencers strategically placed to potentially direct suspicious transactions.

• All on one supercomputer measuring over a decade of property transfers nationwide.

• BIG DATA Products to turn other BIG DATA into compelling intelligence.

• Large Scale Graph Analytics allow for identifying known unknowns.

• Florida Proof of Concept – Highest ranked influencers . Identified known ringleaders in flipping and equity stripping schemes. . Typically not connected directly to suspicious transactions. – Known ringleaders not the Highest Ranking.

• Clusters with high levels of potential collusion. • Clusters offloading property, generating defaults. • Agile Framework able to keep step with emerging schemes in real estate.

WHT/082311

7 Social Graph and Prescriptions http://hpccsystems.com

Scenario Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently. Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems.

Providers recruit members to provide and escalate services that are not rendered.

Task Given a large set of prescriptions. Calculate normal social distributions of each brand and detect where there is an unusual socialization of prescriptions and services.

Result The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk.

WHT/082311

8 The Data/Information flow http://hpccsystems.com

Cyber Security

Financial Services

Government

Health Care Big Insurance Data Online Reservations Retail

Telecommunications

Transportation & Logistics

Weblog Analysis

INDUSTRY SOLUTIONS

Customer Data Fusion Fraud Detection and Prevention Know Your Customer Master Data Management Weblog Analysis

. High Performance Computing Cluster Platform (HPCC) enables data integration on a scale not previously available and real-time answers to millions of users. Built for big data and proven for 10 years with enterprise customers. . Offers a single architecture, two data platforms (query and refinery) and a consistent data-intensive programming language (ECL) . ECL Parallel Programming Language optimized for business differentiating data intensive applications

WHT/082311

9 The Open Source HPCC Systems platform http://hpccsystems.com

• Open Source distributed data-intensive • Share-nothing architecture • Runs on commodity computing/storage nodes • Binary packages available for the most common distributions • Provides end-to-end Big Data workflow management , scheduler, integration tools, etc. • Originally designed circa 1999 (predates the original paper on MapReduce from Dec. ‘04) • Improved over a decade of real-world Big Data analytics • In use across critical production environments throughout LexisNexis for more than 10 years

WHT/082311

10 Components http://hpccsystems.com

• The HPCC Systems platform includes: • Thor: batch oriented data manipulation, linking and analytics engine • Roxie: real-time data delivery and analytics engine • A high level declarative dataflow language: ECL • Implicitly parallel • No side effects • Code/data encapsulation • Extensible • Highly optimized • Builds graphical execution plans • Compiles into C++ and native machine code • Common to Thor and Roxie • An extensive library of ECL modules, including data profiling, linking and Machine Learning

WHT/082311

11 The Three HPCC components http://hpccsystems.com

1 HPCC Data Refinery (Thor) • Massively Parallel data processing engine • Enables data integration on a scale not previously available • Programmable using ECL

2 HPCC Data Delivery Engine (Roxie) • A massively parallel, high throughput, query engine • Low latency, highly concurrent and highly available • Several advanced strategies for efficient retrieval • Programmable using ECL

3 Enterprise Control Language (ECL) • An easy to use, declarative data-centric programming language optimized for large-scale data management and query processing • Highly efficient; automatically distributes workload across all nodes; compiles to native machine code. • Automatic parallelization and synchronization

Conclusion: End to End platform • No need for any third party tools WHT/082311

12 The HPCC Systems platform http://hpccsystems.com

WHT/082311

13 Detailed HPCC Architecture http://hpccsystems.com

WHT/082311

14

Enterprise Control Language (ECL) http://hpccsystems.com

Declarative programming language: Describe what needs to be done and not how to do it Powerful: High level data activities like JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Extensible: Modular and extensible, it can shape itself to adapt to the type of problem at hand Implicitly parallel: Parallelism is built into the underlying platform. The programmer needs not be concerned with data partitioning and parallelism Maintainable: High level programming language, without side effects and with efficient encapsulation; programs are more succinct, reliable and easier to troubleshoot Complete: ECL provides a complete data programming paradigm Homogeneous: One language to express data algorithms across the entire HPCC platform: data integration, analytics and high speed delivery

WHT/082311

15

Enterprise Control Language (ECL) http://hpccsystems.com

• The ECL optimizer generates an optimal execution plan • Sub-graphs run activities in parallel, automatically persist the results and spill to disk as needed • Lazy evaluation avoids re-computing results when data and code haven’t changed • Graphs are dynamic and display number of records traversing the edges and data skew percentages

WHT/082311

16 Machine Learning on HPCC http://hpccsystems.com

. Extensible Machine Learning Library developed in ECL

. Fully distributed across the cluster

. Supports PB-BLAS (Parallel Block BLAS)

. General statistical functions

. Supports supervised, semi-supervised and unsupervised learning methods

. Document manipulation, tokenization and statistical Natural Language Processing

. A consistent and standard interface to classification (“pluggable classifiers”)

. Efficient handling of iterative algorithms (for example, for numeric optimization and clustering)

. Open Source and available at: http://hpccsystems.com/ml WHT/082311

17 ECL-ML: extensible ML on HPCC http://hpccsystems.com

General aspects . Based on a distributed ECL linear algebra framework . New algorithms can be quickly developed and implemented . Common interface to classification (pluggable classifiers) ML algorithms . Linear regression . Several Classifiers . Multiple clustering methods . Association analysis Document manipulation and statistical grammar-free NLP . Tokenization . CoLocation Statistics . General statistical methods . Correlation . Cardinality . Ranking WHT/082311

18 ECL-ML: extensible ML on HPCC (ii) http://hpccsystems.com

Linear Algebra library . Support for sparse matrices . Standard underlying matrix/vector data structures . Basic operations (addition, products, transpositions) . Determinant/inversions . Factorization/SVD/Cholesky/Rank/UL . PCA . Eigenvectors/Eigenvalues . Interpolation (Lanczos) . Identity . Covariance . KD Trees

WHT/082311

19 Machine Learning on HPCC http://hpccsystems.com

. ML on a general-purpose Big Data platform means effective analytics in-situ

. The combination of Thor and Roxie is ideal when, for example, training a model on massive amounts of labeled historical records (Thor), and providing real-time classification for new unlabeled data (Roxie)

When applying Machine Learning methods to Big Data: data profiling, parsing, cleansing, normalization, standardization and feature extraction represent 85% of the problem!

WHT/082311

20 The Future: Knowledge Engineering http://hpccsystems.com

KEL: an abstraction for network/graph processing . Declarative model: describe what things are, rather than how to execute . High level: vertices and edges are first class citizens . A single model to describe graphs and queries . Leverages Thor for heavy lifting and Roxie for real-time analytics . Compiles into ECL (and ECL compiles into C++, which compiles into assembler)

ECL KEL Generates C++ (1->100) Generates ECL (1->12) Files and Records Entities and associations Detailed control of data format Loose control of input format; none of processing Can write graph and statistical algorithms Major algorithms built in Thor/Roxie split by human design Thor/Roxie split by system design Solid, reliable and mature &D WHT/082311

21 KEL by example http://hpccsystems.com

• Actor := ENTITY( FLAT(UID(ActorName),Actor=ActorName) ) • Movie := ENTITY( FLAT(UID(MovieName),Title=MovieName) ) • Appearance := ASSOCIATION( FLAT(Actor Who,Movie What) )

• USE IMDB.File_Actors(FLAT,Actor,Movie,Appearance)

• CoStar := ASSOCIATION( FLAT(Actor Who,Actor WhoElse) )

• GLOBAL: Appearance(#1,#2) Appearance(#3,#2) => CoStar(#1,#3)

• QUERY:FindActors(_Actor) <= Actor(_Actor) • QUERY:FindMovies(_Actor) <= Movie(UID IN Appearance(Who IN Actor(_Actor){UID}){What}) • QUERY:FindCostars(_Actor) <= Actor(UID IN CoStar(Who IN Actor(_Actor){UID}){WhoElse}) • QUERY:FindAll(_Actor) <= Actor(_Actor),Movie(UID IN Appearance(Who IN _1{UID}){What}),Actor(UID IN CoStar(Who IN _1{UID}){WhoElse})

WHT/082311

22

Assorted examples • • • • • • • • • • • • • • • • • Python integration Python integration R integration Streaming/Kafka Data SQL/JDBC integration View Smart Graph traversal problem handling (IMDB/KevinBacon example) SentimentAnalysis example Attribute manager Scoring/statistical models Full textclassifiers Search retrievaland Recommendation systems dataExploratory analysis Data Visualization SALTusing resolution Recordentity linkage, disambiguation, entity Data integrationFlow HPCC using

23

Data Integration Data

24

Data IntegrationFlow HPCC using GeneratesECL code so it can be alsoused as alearning tool Drag dropand interfaceto data workflows HPCCon

on Design Primitives shown thewith ECL workflow,Simple

25

Data IntegrationFlow HPCC using

calling a child job child calling a parentA job

26

Get the sourcecode:Get the https://github.com/hpcc Data IntegrationFlow HPCC using - systems/spoon - plugins

by a parent job Childcalled job

27

Record linkage, entitydisambiguation, entity resolution

28

Scalable Technology AutomatedScalable Linking (SALT) . . . . . Significantproductivity boost! clustering Sophisticatedspecificity based linking and standardization normalization cleansing, parsing, Providesfor automated QA/QC, dataprofiling, TemplateECL generator code based Linking Technology” The acronym stands for “Scalable Automated

Data Sources

and Record LinkageProcesses

Matching Weights and Threshold Computation Data PreparationProcesses Profiling Data Ingest Additional Searching Blocking Parsing ( ETL / ) Weight Assignment Cleansing Linking Iterations 482,410 Lines of of 482,410 Lines C++ Comparison and Record 3,980 Lines of of 3,980 Lines ECL 42 SALT of Lines

Normalization Record Match Decision Standardization Linked Data

File 29

1. 2. Thoroughly employing statistical algorithms, INPUT INPUT Rules based vs. probabilistic based Rules record linkage . . . . matches SALT credible all explore sufficiently matches SALT effectively false minimizes

J Flavio AtlantaSmith, John AtlantaSmith, John

avio

Villanustre, Atlanta

Villanustre, Atlanta

律商 联讯

®

Record 1 Record Record 2 Record

Error Error SALTtwo the has significantadvantages over rules: RULES

RULES

SALT SALT

“John for records city Smith“ the the and both match NO Match occurrence is small and there is only one present in in present one only is there and small is occurrence “Villanustre” is specific because the frequency of of frequency the because specific is “Villanustre” occurrence is largeare there is and occurrence many presentin Smith” is not is Smith” NO MATCH Match MATCH “ Flavio , because the system has learnt that “John , because the system learnt thathas , determinebecausethethat rules “ and “ and “ , because the rules determine that

specific because the frequency of the of frequency because specific Javio Atlanta Atlanta

” are” the same not

30

Technology Record Linkage: SALT Another purpose of SALT ofAnother purpose isto perform record linking

Scalable Automated Linking

31

Data Visualization

32

The Excel Visual Basic code to retrieve codetodata: ExcelBasic The Visual Access a Roxiea todisplay Accessdynamic data querya dashboardfrom Excel: within End Function Application.EnableEvents Call [Turnover]. [Turnover]. Application.EnableEvents Function

Data Visualization: Excel Integration refreshdata QueryTable.Connection QueryTable.Refresh ()

= True = = False=

= " URL;http ://10.242.99.99:8002/

WsEcl / xslt /query/ roxie

/ excelqueries

? asofdate =" + [ =" + qrydatecell].Text

33

Excel:in Roxiepublished Query the Selecting Data Visualization: Excel Integration…

34

Data Visualization: Reports can be saved locally or published to a a to published saved or locally can be Reports designer report visual Desktop

PentahoReport Designer BI Server DirectPort jdbc::ServerAddress URL: Connection Custom Connection Type: call roxie:: Roxie Query: ${ = designer Copy the HPCC JDBC Driver directoryin from from month_of_year select quarter, Adhoc a Roxiequery. by calling Reports/Charts can be created bywriting a custom query or org.hpccsystems.jdbcdriver.HPCCDriver CustomClass DriverName: true:PageSize pQtr foodmart

} SQL Query: SQL

\lib all_cancers_for_state =8008:EclResultLimit=100:QuerySet= jdbc and\jdbc specify the following values:

=100:TraceLevel=ALL:TraceToFile ::agg_c_10_sales_fact_1997quarterwhere = store_sales

Parameter

Generic Database

=99.999.999.9;Cluster= , unit_sales

(GA) (GA)

, store_cost,the_year

\report thor:LazyLoad =true: default:WsECL -

, 35

Data Visualization: Pentaho Report Download: http://reporting.pentaho.com/report_designer.phpDownload: Selection Parameter

Designer

36

response times to complex analytical queries. complex to times analytical response Usersexplore business data by drilling into and cross Real Data Visualization: - time graphical Analyticalonline Processing HPCC on (OLAP) Download: http://meteorite.bi/saiku Download: Mondrian using Saiku using Mondrian UI

- tabulating information speed with -

of - thought 37

Eclipse Data Visualization: • • Types:Data Source BIRT Web Service (Roxiequeries) (for JDBC - basedsource openreporting system HPCC with AdHoc queries or Roxiequeries) or queries

Types Data Source

Source Data JDBC

BIRT

as Data Source Web Service

38

Data Visualization: http://download.eclipse.org/birt/downloads/Download: BIRT ExampleOutput

39

• • • results, the such of as: This allows userto perform“Cell Formatting” the some “JavaScript”,ratherdefault the than plain text.example See below.snippet ECLcode ECLIn the Watch Tech Preview, userscan specific specify columnsoutput to contain “HTML” or • • • Visualizations to:canused be Data Visualization: Result Cell Formatting with lat.with and long. etc. Embed Hyperlinks: Cell Add textual formatting, e.g.: Summarize data. Demonstrate data/products Understand new data color :

Make cell Red/Yellow a that rowscertain are viewerthe important. to alert

Embed a link to a preformatted Google search, or a link to a Google map map tolink preformatteda a Google a tolink Embed orsearch, Google

d := dataset([{d := d3.Chart('cars',d3Chart := 'name','unused'); end; d; recordr := import Bundles.d3; import Text text) (bold

varstring dataset(

SampleData.Cars

Cars.CarsRecord

pc__javascript Cars.CarsDataset

;

; ) cars; )

, d3Chart.ParallelCoordinates}],, r);

javascript “__ trailingaor “__html” Name the column with

40

DataResult Visualization: Formatting Cell Output

41

Chart Data Visualization: Result Cell Formatting

Basic 42

Structures Data Visualization: Result Cell Formatting –

Tree 43

Data Visualization: Result Cell Formatting –

Graphs

44

Exploratory Data Analysis 45

The EDAThe tool tools set be willto added Flow HPCC that data. allowon biganalytics randomrecords existing in toused be analytictools SAS.and suchR as the ability to reduce a“Big Data” problem down to a statistically significant subset of EDA toolsprovidenot will existingallthe functionalityof toolsbenefitis additional soan ExploratoryData Analysis • • EDA Procedures • • • • • • variables Correlation statistics between two data of existing columns created bymore doing math or one on data add columns of to easily Options observation each for x/y as such opsmath Simple Random numbergenerators Tabulate procedure ascendingand descending Rank Max Percentiles,Deviation, Min, Standard Mode, Median, Mean, Frequency

(Essentially ProcUnivariate)

• • • • • • EDA Visualization and Histogramand Graph options will include Line, Bar, Pie plotted against X and the axisY GUI will allow userto define which data is Pie charts Histograms Plots Line Scatter Plots

The The 46

Sentiment Analysis 47

and and probabilities, and general inductive inference related problems. covering document learning, unsupervised textand supervised and analysis, statistics Matrix and (ML)Learning processing algorithms to assist intelligence;withbusiness SystemsECLHPCCits and 3. 2. 1. Sentiment Analysis:ECL as shown in in as shownthe example: statement import a using source Referencethe library in yourECL file to source ECLthe folder. IDE Extract the contents of the zip http://hpccsystems.com/ml the Library ML Download - ML libraryML includes extensible an set of fully parallel Machine

- ML IMPORT * FROM FROM * IMPORT FROM * IMPORT ML; FROM * IMPORT x3 := := x3 4}], 2, {3, 9}, 1, {3, DATASET([c := 7}, 2, {2, 5}, 1, {2, 5}, 2, {1, 1}, 1, {1, OUTPUT(x3); 4}], 2, {7, 9}, 1, {7, 4}, 2, {6, 1}, 1, {6, 3}, 2, {5, 9}, 1, {5, 0}, 2, {4, 0}, 1, {4, 1}, 2, {3, 8}, 1, {3, DATASET([ := x2 7}, 2, {2, 5}, 1, {2, 5}, 2, {1, 1}, 1, {1, ); NumericField

Kmeans

(x2,c); ML.Types ML.Cluster ); NumericField

;

;

48

ML.Classify sentimenta of that object.” of Forexample, othersome valueor attribute canI predict whether 3. 2. 1. steps:logical three Usingclassifier a ininvolves ML ECL classification a it give to orderdata new in to Classifying.classifierApply the wellhowclassifierthe fits Testing. Gettingmeasures of classifiedbeen externally trainingdata setof that has a from model the Learning - ML Classification: ML Naïve BayesClassification

tackles the problem:“given I know factsthese can object;an about Ipredict tweet

is is positive or negative? or positive .

.

OUTPUT(Results); Results := //Classifying TestModuleOUTPUT( TestModule Testing// Model := Model // the model Learning BayesModule */ ML.Docs.Trans dLexicon ML.Docs Use /* ML; IMPORT dTokens dRaw classification trainingset:

-

collection of and of negativecollection positive sentimenttweets

:= :=

:= :=

BayesModule.LearnD BayesModule.ClassifyD( ML.Docs.Tokenize.Split

ML.Docs.Tokenize.Lexicon := :=

:= := ( BayesModule.TestD ML.Docs.Tokenize.ToO

module to pre to module ML.Classify.NaiveBayes

);

IndependentDS - ( process tweet text and generate IndependentDS ( IndependentDS ( ML.Docs.Tokenize.Clean IndependentDS ( ( dTokens,dLexicon)). dTokens ;

, ClassDS ); ,

ClassDS , Model

, ClassDS ); WordsCounted; );

(

dRaw );

));

49

Sentilyze negative sentiment. tweets via the Twitter the via tweets realin API Enter a and search receive query real Classification Sentiment Analysis:“

uses HPCC Systemsuses HPCC itsML and

- time. - time time

Sentilyze - Library to classify tweets with positive or or positive with tweets to classify Library ” Twitter” Sentiment IMPORT IMPORT Keyword both Naïve and using Count classified Example code of dataseta of tweets thatare sentiment classifiers: processTweets rawTweets DATASET('~SENTILYZE:: Tweets := OUTPUT( nbSentiment kcSentiment OUTPUT(

https://github.com/hpcc- sourceGet the code: http://hpccsystems.com/demos/twitter Try demo:the Sentilyze kcSentiment,NAMED nbSentiment,NAMED

:= :=

:= := := := Sentilyze.PreProcess.ConvertToRaw

:= := Sentilyze.KeywordCount.Classify Sentilyze.NaiveBayes.Classify ; Sentilyze.PreProcess.RemoveAnalysis

TWEETS',Sentilyze.Types.TweetType.CSV

(' (' TwitterSentiment_KeywordCount TwitterSentiment_NaiveBayes systems/ecl ( processTweets - ( processTweets ml (Tweets); -

sentiment ( rawTweets

Bayes );

')); ); ); 50

'));

)

Graph TraversalProblemHandling

51

http://hpccsystems.com/download/docs/six examplethe Get code: problems forcustomerfinding insight detectionand of important data patterns trends. and Data analysis techniques HPCC using Systems ECL can applied be to social graph traversal GraphTraversal ProblemHandling

- degrees

aka “ SeparationLinksDegrees of using and Getting Useful Information fromData Kevinexample Bacon

52

Smart View Smart 53

requiring fusion forand missiontime organizations resolution challenges. compellinga option fordatathe scientist businessor analyst facedwith complex entity interface,leveragingalongwith its patented, a of statistically Smart View™Smart(SV)generates comprehensive profiles‘entities’ of – Smart View™Entity ResolutionVisualization & Linking Iteration Records Individual Source –

from a ‘perfect storm’ ofbig data (

1

- critical applications).combination The SV’sof intuitive user 2

3

ie , disparate,, complexand massive datasets 4

- based clustering technique, make it formulatingan in

across 4 iterations of the identity being analyzed. analyzed. being the identity of people, places, and recordstogetherwere fused depicting how 12 individual how 12 individual depicting This Smart View screenshot Smart This containsa visualization

- depth profile depth profile –

ultimately

54

relationship visualizationsdisplaythat‘click’ and dynamicallyaccording ‘point’ to commands. scientistinterpreting in analysis whattheir link results signify, ofset aViewSmartincludes statisticalprocess (similarSV’sto how entityresolution problem solved). is To dataaide the user to define a series ofrelationships, then uncover such relationships through an iterative, linkagesuncovered. betweenbe will them View’sSmart analysislink feature allowsend- the The morecomprehensive your entity profilesare, the greater the probability that non- Smart View™Link Analysis Visualization& linkages? sample a justare These degrees separation? of Forthose How manyHow individualsare linked address (for Kevin Bacon, or any or Kevin(for Bacon, address separation, how ‘strong’ are are the ‘strong’ how separation, other entitythat for matterother of the many questions SV’s questions the many of link analysis featureusershelp can actors with degreeone of to Kevin Bacon through 3 through to Kevin Bacon

obvious obvious 

) 55

SQL/ JDBC IntegrationJDBC SQL/

56

yourdata thetowrite needwithout single a line of ECL! This allows youto connectto HPCC Systems the platform through yourfavorite client JDBC and access queries. DriverJDBC HPCCThe providesSQL SQL/JDBC Integration • • • • • write ECL! and Leveragelearnto need dataclient functionality without HPCC JDBC and Creates entry Harnessesthe full power of HPCC under the covers Supports simple SQL SELECT or CALL syntax Analyze HPCC data using manyClient Interfaces JDBCUser

• • • • • Access published queries as DB Stored Procedures Opens the doorfor non ECLprogrammers to access HPCC data. Automatic Index fetching capabilities forquicker data fetches compiled, executed and your on target cluster Submitted SQL request generates ECL code which is submitted, Access HPCC data files as DB Tablesdata HPCC asDB files Access

- point for programmaticdatafor access point - based accessbased toSystems HPCC data published and files

57

Cluster=default:WsECLDirectPort=8010:EclResultLimit=100:QuerySet= thor:LazyLoad=

public Class.forName

SQL/JDBC Integration: Example Code

public

} {

class

} catch try Connection con= Connection

static con=

{

HpccConnect

(Exception e){ //1. Initialize HPCC Connection//1. (

" }

con.close }

System.

ResultSet pst.clearParameters

DriverManager. } //5. Close the connection while //4. Loop through all the records and write the information to the console console the to information the and write records the all through Loop //4. //3. Print the headers to the console the to headers the Print //3. //2. Write theWrite Queryselect//2. to PreparedStatement org.hpccsystems.jdbcdriver.HPCCDriver

void e.printStackTrace System.

( main(String[] main(String[] rs.next out (); null;; rs

{ .println = out

pst.executeQuery ()) {

.println ( getConnection

( " args (); conversion_ratio pst (); ( rs.getObject(1)+ = ) con.prepareStatement

(); foodMart " jdbc:hpcc:ServerAddress

\ t \ http://hpccsystems.com/products Driver: JDBCDownload " tcurrency_id " ::currency table table ::currency \ ); t \ t"

+ rs.getString ( "select \ t \ tcurrency conversion_ratio,currency_id,currency

=99.99.99.99; (2)+ " \ t \ " t" ); +

rs.getString

(3)); true:PageSize

- - and services/products/plugins/JDBC

=100:LogDebug=true:" from from foodmart ::currency" , "" , "" ); ); - Driver 58

Data Streaming

59

Data Streaming: ApacheKafka • • • • Apache Kafka is a distributedpublish support the following:support the consumption over a clusterof consumer machines whilemaintaining per Explicitsupport for partitioningmessages over Kafka serversdistributing and High performanceeven withmany ofstored TB messages. Persistent diskO(1) messaging structureswith that provideconstant time thousands of messages per second. ordering semantics. Horizontally Scalable

- throughput:even with very modest hardware Kafka can support hundreds of

-

subscribe messaging system. It to is designed -

partition partition 60

Data Streaming: Howwe are usingApache Kafka 61

R IntegrationR

62

rHPCC ) ) } ECLProject # ExampleECL CODE: PROJECT( transform# the recordeach on function turn.in result <- { print= function() def << - addField # The PROJECT function processes through all records in the # Creates ECL an "PROJECT" definition. }, }, result <- getName = list( methods R Integration:

",",def," ));") outECLRecord

is an R package providing an interface between R HPCC and Systems.

< -

setRefClass("

="

ECLRecord

paste(def," ", id, " := ", value, ";")

= function(id, value){ = function(id, = function() { = function() paste(name, " := PROJECT( ", name ECLProject ", def = "character"), recordset,TRANSFORM

rHPCC

", contains="",

Plugin ECLDataset ( record,SELF inDataset$getName recordset ", fields = list(name = "character",= list(namefields ",

:= LEFT));

performing (), ", TRANSFORM(",(), outECLRecord$getName ,()

inDataset =" ECLDataset ", ", 63

ECL CodeSnippet EMBED(R) using R Integration:R Example Code result; // Hello World cat('Hello',result:= 'World'); ENDEMBED; paste( STRING cat(VARSTRING what, VARSTRINGEMBED(R) who) := 2:EXAMPLE arr SET OF INTEGER2 1:EXAMPLE IMPORT R; arr ENDEMBED; val sort( [3]; // Returns Returns 5 // [3];

:= := sortArray what,who

); ); // R code here

([ - ) sortArray 111, 0, 113, 12, 45, 5]); 45, 12, 113, 0, 111, // R code R here//

( SET OF INTEGER2

val ) := EMBED(R) ) :=

64

Java Integration

65

Systems Platform programmingfromanother language in general. statusserver updates. secondarygoal A to project this tois provide examples using the on HPCC handle generating the ECL code, calling the Cluster via the SOAP interface, and retrievingresults and Java ECLAPI was developedtool HPCC Flow building along side the JAVA the necessary codebase to Java Integration:

client no with Example JavaECL API call: To compiler local a ECL codewith the check

boolean ECLSoap String //getECLSoap boolean String ArrayList if(! isValid String boolean String

} if( } } if(

es.getErrorCount es.getWarningCount this.isValid wuid eclCode outStr errorMessage

=

isValid isError isWarning es es.executeECL

dsList

= =

""; = getECLSoap () is a class used that sets up the server connection and returns a and returns connection server the up sets that used class a () is (); es.getWuid ){

World’);”; “output(‘Hello =

null; = this.error

false; = false; = isWarning isError

() > 0){ () > false; =

= (

( () > 0){ () > es.syntaxCheck true; = eclCode ();

Java ECLAPI += "

true; =

side compile check compile side \ r \ ); nFailed

( eclCode to execute code on the cluster, please verify your settings your verify please cluster, the on code execute to

)).trim();

ECLSoap

class submit to cluster submit and Compile

\ r \ n";

66

• • • • • • • • • • • • • Java Integration: Java ECLAPI – Merge Loop Limit Join Iterate Index Group FromField Distribute DeSpray Dedup Dataset Count ECL Primitive

• • • • • • • ToField Table Spray Sort Rollup Project Output

• • • • • Internal Linking Hygiene Report Specificities Concepts Data ReportProfile SALT

Code GeneratorsCode

• • • • • • • Naïve Bayes KMeanes Discretize Classify Classification Classify Build Associate ML Library ML

67

Generator Java Integration: Java ECLAPI – Render ECLExample Resulting ECL Code ECL Resulting String String join.setRightRecordSet join.setLeftRecordSet join.setJoinType join.setJoinCondition join.setName Join

myRecSet join joinResults

=

new Join(); := join( := ( “

myRecSet ( “INNER”);

MyLeftRecordSet,MyRightRecordSet,left.theID = join.ecl(); =

(" (

“ ( left.theID” MyLeftRecordSet “ MyRightRecordSet ”);

= "+"right.theID ”);

”);

"); ECLPrimitive Code

= right.theID , INNER);

68

Data Profile Example Java Integration: Java ECLAPI –

SALTGenerator Code

69

Data Profile Example Result Set Java Integration: Java ECLAPI –

https://github.com/hpcc- source code:the Get http://hpccsystems.com/demos/data Trydemo: the

SALTGenerator Code

systems/java - profiling - ecl - api

- demo

70

Python IntegrationPython

71

ECL CodeSnippet EMBED(Python) using Python Integration Python Example Code : 1:EXAMPLE 2:EXAMPLE arr SET OF INTEGER IMPORT Python; result[1]; // Returns ‘green’ := result ENDEMBED; return sorted OFSET STRING arr ENDEMBED; return sorted [3]; // Returns Returns 5 // [3];

:= := sortArray sortArrayOfString

sortArrayOfString

(

([ val

- ( sortArrayOfIntegers 111, 0, 113, 12, 45, 5]); 45, 12, 113, 0, 111, val // Python codehere Python // ) ) ) // Python code here code Python //

(['

red','green','yellow

(SET OF STRING STRING OF (SET

(SET OF INTEGER(SET OF

val

) := ']); EMBED(Python)

val ) := EMBED(Python)

72 What’s next? http://hpccsystems.com

. Version 5.2 should be out EOY.

. Ongoing R&D (5.2 and beyond):

. Level 3 BLAS support

. More Machine Learning related algorithms

. Heterogeneous/hybrid computing (FPGA, memory computing, GPU)

. Knowledge Engineering Language driving ML

. General usability enhancements (GUI to SALT data profiling and linking, etc.)

. Embedding other languages (Python, JavaScript, Java and R, for starters)

. Integration with other tools and systems (Pentaho Mondrian, R, Kafka, etc.) WHT/082311

73

Trivia Answer the question and winprize! a • • • • • • parsing cleansing?and What toused can generatebe ECLcodefor automating data profiling, Name Name three examples where HPCCcan integratedbe and applied. Whatstand KEL does for? platform? What processingquery is the HPCC languagetheSystems supporting analytics? Whatthe data of name is the that delivery engine providesreal data manipulation? Whatthe data of name is the refinerythat engine providesbatch oriented

- time time 74

...... Useful Links Email Phone Contact information: Our GitHub portal: Systems blog: HPCC The Online Training: MachineLearning portal: Systems Platform: HPCC Source Open LexisNexis Community Forums: :

: 678 6943320 [email protected]

http https:// ://learn.lexisnexis.com/hpcc http://hpccsystems.com/bb http:// http://hpccsystems.com/ml github.com/hpcc hpccsystems.com/blog - systems http://hpccsystems.com

75 Questions??? http://hpccsystems.com

WHT/082311

76 Appendix http://hpccsystems.com

WHT/082311

77 Terasort Benchmark results http://hpccsystems.com Execution Time Productivity Space/Cost (seconds)

140

120 Previous leader 100 80 60

40 20 0 HPCC HPCC Previous leader

WHT/082311

78 Medicaid Case Study http://hpccsystems.com

Scenario Proof of concept for Office of the Medicaid Inspector Generation (OMIG) of large Northeastern state. Social groups game the Medicaid system which results in fraud and improper payments.

Task Given a large list of names and addresses, identify social clusters of Medicaid recipients living in expensive houses, driving expensive houses.

Result Interesting recipients were identified using asset variables, revealing hundreds of high-end automobiles and properties.

Leveraging the Public Data Social Graph, large social groups of interesting recipients were identified along with links to provider networks.

The analysis identified key individuals not in the data supplied along with connections to suspicious volumes of “property flipping” potentially indicative WHT/082311 of mortgage fraud and money laundering

79

Network Traffic Analysis in Seconds http://hpccsystems.com

Scenario Conventional network sensor and monitoring solutions are constrained by inability to quickly ingest massive data volumes for analysis - 15 minutes of network traffic can generate 4 Terabytes of data, which can take 6 hours to process - 90 days of network traffic can add up to 300+ Terabytes

Task Drill into all the data to see if any US government systems have communicated with any suspect systems of foreign organizations in the last 6 months - In this scenario, we look specifically for traffic occurring at unusual hours of the day

Result Horizontal axis: time on a logarithmic scale In seconds, the HPCC sorted through months of network traffic Vertical axis: standard deviation (in hundredths) to identify patterns and suspicious behavior Bubble size: number of observed transmissions

WHT/082311

80 Wikipedia Graph Demo http://hpccsystems.com

Scenario Calculate Google Page Rank to be used to rank search results and drive visualizations.

Task Load the 75GIG English Wikipedia XML snapshot. Strip page links from all pages and run 20 iterations of Google Page Rank Algorithm. Generate indexes and Roxie query to support visualization.

Result Produces +- 300 million links between 15 million pages. Page Rank allows for ranking results in searching and driving more intuitive visualizations.

Interactive Visualization uses calculated page rank to define size and color of the nodes. Advanced Lays a foundation for advanced graph algorithms that combine ranking, Natural Language Processing and Machine Learning in scale. WHT/082311

81 Wikipedia Pageview Demo http://hpccsystems.com

Scenario 21 Billion Rows of Wikipedia Hourly Page view Statistics for a year.

Task Generate meaningful statistics to understand aggregated global interest in all Wikipedia pages across the 24 hours of the day built off all English Wikipedia page view logs for 12 months.

Result Produces page statistics that can be queried in 1 Steve_Jobs seconds to visualize which key times of day each 2 Whitney_Huston Wikipedia page is more actively viewed. 3 Wikipedia:SOPA_initiative/Learn_more 4 List_of_HTTP_status_codes%231xx_Informational Helps gain insight into both regional and time of day 5 Adele_(singer) key interest periods in certain topics. 6 Bruno_Mars 7 Kim_Jong-il This result can be leveraged with Machine Learning 8 Jeremy_Lin to cluster pages with similar 24hr Fingerprints. 9 Heavy_D 10 Christopher_Columbus 11 Murder_of_Meredith_Kercher 12 Eli_Manning 13 Jorge_Luis_Borges WHT/082311

82 Social Graph Analytics - Collusion http://hpccsystems.com

• LexisNexis Public Data Social Graph (PDSG) • Public Data relationships. • High Value relationships for Mapping trusted networks. • Large Scale Data Fabrication and Analytics. • Thousands of data sources to ingest, clean, aggregate and link. • 300 million people, 4 billion relationships, 700 million deeds. • 140 billion intermediate data points when running analysis. • HPCC Systems from LexisNexis Risk Solutions • Open Source Data Intensive high performance supercomputer. (http://hpccsystems.com) • Innovative Examples leveraging the LexisNexis PDSG • Healthcare. • Medicaid\Medicare Fraud. • Drug Seeking Behavior • Financial Services. • Mortgage Fraud. • Anti Money Laundering. • “Bust out” Fraud. • Potential Collusion (The value in detecting non arms length transactions)

WHT/082311

83 Connected Legal Content Universe http://hpccsystems.com

Docket Legislative Jury Legal Topics Number Instructions Branch Verdicts & Settlements Judicial / Executive & Court Branch Administrative Orders & Statutory Courts Decisions Code & Attorney Legislation Admitting Briefs, Authorities Pleadings & Motions Legal Matter Executive Jurisdiction Branch Government Official Area of Practice Dockets Newsworthy Person Judge Armed Service Branch Case Law Public Records Regulatory Attorney Bodies News Law Firm

Regulatory Register Education Exchanges & Exchange Commercial Treatises & Institutions Expert Codes Organization Analytical Witness Materials Analysis

Subject Non-profit Industries & Organization Markets Industry Bar Associations Product

Scientific Expert Witness

User/Custo Geography mer Created Intellectual Company Content Property Information Expertise

WHT/082311

84