“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”

“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data” Hiroyuki MAKINO, NTT Software Innovation Center XLDB2012 Sep. 12, 2012 © 2012 NTT Software Innovation Center Objective of Jubatus • To satisfy “Scalable”, “Realtime”, and “Profound analysis” for Big Data. For variety Profound Analysis SVMlight Scalability For volume For velocity © 2012 NTT Software Innovation Center Use Case: Social stream analysis Classifies into 1600 companies Analysis Twitter result More than 8000 tweets/sec Realtime twitter analysis with Jubatus • Social stream analysis for marketing research • We want to know reputation or mood from their voice. • Too hard to go over each tweet • We need machine learning to classify tweets automatically in Demonstration realtime. See you in the poster session © 2012 NTT Software Innovation Center Performance The number of servers and throughput Accuracy and learning time [million] 3.5 3.0 Accuracy of batch learning 2.5 6700 tweets/sec. 2.0 with 1 server 1.5 90% accuracy 100 thousand in 1 sec. tweets/sec. Accuracy[%] 1.0 with 16 servers The number of servers Throughput[features/sec.] 0.5 0 0 2 4 6 8 10 12 14 16 18 The number of servers * 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec. *SPAM classification task (Dataset: LIBSVM webspam) Throughput and linear scalability Accuracy Classification Tests: Classifying tweets into 1600 - Achieving the accuracy as good as batch companies automatically processing based learning. - Throughput： Classifies all the Japanese - 90% accuracy in 1 sec. with 16 servers tweets (6700 tweets/sec) with 1 server © 2012 NTT Software Innovation Center KEY POINT 1: Jubatus Architecture Server processes User processes Proxy processes MIX API (learning model synchronization) Big Data API API RPC RPC ZooKeeper processes • User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients) • Acquires input data and sends requests to servers via proxies • Proxy process (JubaKeeper) • Relays clients’ requests to servers • Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization • Increases performance linearly with the number of servers • ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection © 2012 NTT Software Innovation Center KEY POINT 2: MIX technique • Deterioration in the consistency of data synchronization is considered to be a loss of data to be trained, and it results in a decrease in accuracy. • MIX technique can increase robustness by exchanging the intermediate results among servers loosely. Data Membership management Server failures: Server addition: MIX technique Stop sharing Start sharing aggregated result aggregated result MIX calculator 1. Leader election [Calculation for aggregated result based on intermediate results] addition Analysis Analysis Analysis Analysis MIX Protocol 3. Calculating controller Intermediate Intermediate Intermediate aggregated Intermediate [Regulating aggregated Results r0 Results r1 Results r2 Results rn result from result] MIX intermediate calculation 2. Aggregation results Membership Aggregated Aggregated Aggregated Aggregated R = f (r0, r1, r2, …, rn) manager Result R Result R Result R Result R [Leader election, members addition or deletion] 4. Sharing the aggregated result © 2012 NTT Software Innovation Center Open Source Software • Jubatus OSS website • http://jubat.us • Github https://github.com/jubatus/jubatus Supported machine learning Description Algorithms Usecases engines Perceptron, Passive Aggressive (PA), Confidence Classify input data into Spam mail Weighted Learning (CW), Linear classification given categories Twitter classification AROW, Normal HERD (NHERD) Recommend similar data Item recommendation Inverted Index, LSH, MinHash Recommendation as input data Advertisement Power consumption Estimate output value for SVR using PA estimation Regression input data Stock price estimation Calculate statistics data, such as sum, max, min, Sensor monitoring Statistics average, standard Data anomaly detection deviation, etc Extract a centrality and Centrality Social community Graph mining shortest path of the given computation(PageRank), analysis graph structure Shortest path search Network traffic analysis © 2012 NTT Software Innovation Center .

“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”

NTT Technical Review, December 2012, Vol. 10, No. 12

Outline of Machine Learning

Sensorbee Documentation Release 0.4

Data Analysis from Scratch with Python: the Complete Beginner's

Machine Learning for Computer Security Fourteenforty Research Institute, Inc

Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases

Comprehensive Analysis of Data Mining Tools S

Pysad: a Streaming Anomaly Detection Framework in Python

Distributed Decision Tree Learning for Mining Big Data Streams