“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”
Hiroyuki MAKINO, NTT Software Innovation Center
XLDB2012 Sep. 12, 2012 © 2012 NTT Software Innovation Center
Objective of Jubatus
• To satisfy “Scalable”, “Realtime”, and “Profound analysis” for Big Data.
For variety Profound Analysis
SVMlight
Scalability For volume
For velocity © 2012 NTT Software Innovation Center
Use Case: Social stream analysis
Classifies into 1600 companies
Analysis Twitter result
More than 8000 tweets/sec Realtime twitter analysis with Jubatus • Social stream analysis for marketing research • We want to know reputation or mood from their voice. • Too hard to go over each tweet • We need machine learning to classify tweets automatically in Demonstration realtime. See you in the poster session
© 2012 NTT Software Innovation Center
Performance
The number of servers and throughput Accuracy and learning time [million] 3.5
3.0 Accuracy of batch learning
2.5 6700 tweets/sec. 2.0 with 1 server
1.5 90% accuracy 100 thousand in 1 sec.
tweets/sec. Accuracy[%] 1.0 with 16 servers The number of servers Throughput[features/sec.] 0.5
0 0 2 4 6 8 10 12 14 16 18 The number of servers * 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec. *SPAM classification task (Dataset: LIBSVM webspam)
Throughput and linear scalability Accuracy Classification Tests: Classifying tweets into 1600 - Achieving the accuracy as good as batch companies automatically processing based learning. - Throughput: Classifies all the Japanese - 90% accuracy in 1 sec. with 16 servers tweets (6700 tweets/sec) with 1 server
© 2012 NTT Software Innovation Center
KEY POINT 1: Jubatus Architecture Server processes User processes Proxy processes MIX API (learning model synchronization)
Big Data API
API RPC RPC
ZooKeeper processes • User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients) • Acquires input data and sends requests to servers via proxies • Proxy process (JubaKeeper) • Relays clients’ requests to servers • Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization • Increases performance linearly with the number of servers • ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection
© 2012 NTT Software Innovation Center
KEY POINT 2: MIX technique • Deterioration in the consistency of data synchronization is considered to be a loss of data to be trained, and it results in a decrease in accuracy. • MIX technique can increase robustness by exchanging the intermediate results among servers loosely. Data Membership management Server failures: Server addition: MIX technique Stop sharing Start sharing aggregated result aggregated result MIX calculator 1. Leader election [Calculation for aggregated result based on intermediate results] addition Analysis Analysis Analysis Analysis MIX Protocol 3. Calculating controller Intermediate Intermediate Intermediate aggregated Intermediate [Regulating aggregated Results r0 Results r1 Results r2 Results rn result from result] MIX intermediate calculation 2. Aggregation results Membership Aggregated Aggregated Aggregated Aggregated R = f (r0, r1, r2, …, rn) manager Result R Result R Result R Result R [Leader election, members addition or deletion] 4. Sharing the aggregated result
© 2012 NTT Software Innovation Center
Open Source Software
• Jubatus OSS website • http://jubat.us • Github https://github.com/jubatus/jubatus
Supported machine learning Description Algorithms Usecases engines Perceptron, Passive Aggressive (PA), Confidence Classify input data into Spam mail Weighted Learning (CW), Linear classification given categories Twitter classification AROW, Normal HERD (NHERD) Recommend similar data Item recommendation Inverted Index, LSH, MinHash Recommendation as input data Advertisement Power consumption Estimate output value for SVR using PA estimation Regression input data Stock price estimation Calculate statistics data, such as sum, max, min, Sensor monitoring Statistics average, standard Data anomaly detection deviation, etc
Extract a centrality and Centrality Social community Graph mining shortest path of the given computation(PageRank), analysis graph structure Shortest path search Network traffic analysis © 2012 NTT Software Innovation Center