“Scalable Distributed Online Framework for Realtime Analysis of Big Data”

Hiroyuki MAKINO, NTT Software Innovation Center

XLDB2012 Sep. 12, 2012 © 2012 NTT Software Innovation Center

Objective of Jubatus

• To satisfy “Scalable”, “Realtime”, and “Profound analysis” for Big Data.

For variety Profound Analysis

SVMlight

Scalability For volume

For velocity © 2012 NTT Software Innovation Center

Use Case: Social stream analysis

Classifies into 1600 companies

Analysis Twitter result

More than 8000 tweets/sec Realtime twitter analysis with Jubatus • Social stream analysis for marketing research • We want to know reputation or mood from their voice. • Too hard to go over each tweet • We need machine learning to classify tweets automatically in Demonstration realtime. See you in the poster session

© 2012 NTT Software Innovation Center

Performance

The number of servers and throughput Accuracy and learning time [million] 3.5

3.0 Accuracy of batch learning

2.5 6700 tweets/sec. 2.0 with 1 server

1.5 90% accuracy 100 thousand in 1 sec.

tweets/sec. Accuracy[%] 1.0 with 16 servers The number of servers Throughput[features/sec.] 0.5

0 0 2 4 6 8 10 12 14 16 18 The number of servers * 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec. *SPAM classification task (Dataset: LIBSVM webspam)

 Throughput and linear scalability  Accuracy Classification Tests: Classifying tweets into 1600 - Achieving the accuracy as good as batch companies automatically processing based learning. - Throughput: Classifies all the Japanese - 90% accuracy in 1 sec. with 16 servers tweets (6700 tweets/sec) with 1 server

© 2012 NTT Software Innovation Center

KEY POINT 1: Jubatus Architecture Server processes User processes Proxy processes MIX API (learning model synchronization)

Big Data API

API RPC RPC

ZooKeeper processes • User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients) • Acquires input data and sends requests to servers via proxies • Proxy process (JubaKeeper) • Relays clients’ requests to servers • Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization • Increases performance linearly with the number of servers • ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection

© 2012 NTT Software Innovation Center

KEY POINT 2: MIX technique • Deterioration in the consistency of data synchronization is considered to be a loss of data to be trained, and it results in a decrease in accuracy. • MIX technique can increase robustness by exchanging the intermediate results among servers loosely. Data Membership management Server failures: Server addition: MIX technique Stop sharing Start sharing aggregated result aggregated result MIX calculator 1. Leader election [Calculation for aggregated result based on intermediate results] addition Analysis Analysis Analysis Analysis MIX Protocol 3. Calculating controller Intermediate Intermediate Intermediate aggregated Intermediate [Regulating aggregated Results r0 Results r1 Results r2 Results rn result from result] MIX intermediate calculation 2. Aggregation results Membership Aggregated Aggregated Aggregated Aggregated R = f (r0, r1, r2, …, rn) manager Result R Result R Result R Result R [Leader election, members addition or deletion] 4. Sharing the aggregated result

© 2012 NTT Software Innovation Center

Open Source Software

• Jubatus OSS website • http://jubat.us • Github https://github.com/jubatus/jubatus

Supported machine learning Description Algorithms Usecases engines , Passive Aggressive (PA), Confidence Classify input data into Spam mail Weighted Learning (CW), Linear classification given categories Twitter classification AROW, Normal HERD (NHERD) Recommend similar data Item recommendation Inverted Index, LSH, MinHash Recommendation as input data Advertisement Power consumption Estimate output value for SVR using PA estimation Regression input data Stock price estimation Calculate statistics data, such as sum, max, min, Sensor monitoring Statistics average, standard Data anomaly detection deviation, etc

Extract a centrality and Centrality Social community Graph mining shortest path of the given computation(PageRank), analysis graph structure Shortest path search Network traffic analysis © 2012 NTT Software Innovation Center