NTT Technical Review, December 2012, Vol. 10, No. 12

Feature Articles: Platform Technologies for Open Source Cloud Big Data Jubatus in Action: Report on Realtime Big Data Analysis by Jubatus Keitaro Horikawa, Yuzuru Kitayama, Satoshi Oda, Hiroki Kumazaki, Jungyu Han, Hiroyuki Makino, Masakuni Ishii, Koji Aoya, Min Luo, and Shohei Uchikawa Abstract This article revisits Jubatus, a scalable distributed framework for profound realtime analysis of big data that improves availability and reduces communication overheads among servers by using parallel data processing and by loosely sharing intermediate results. After briefly reviewing Jubatus, it describes the technical challenges and goals that should be resolved and introduces our design concept, open source community activities, achievements to date, and future plans for expanding Jubatus. 1. Introduction There are two major types of big data and tech- niques for analyzing it. There is little doubt that data is of great value and (1) Stockpile type: Lumped high-speed analysis of importance in databases, data mining, and other data- accumulated big data (batch processing) centered applications. With the recent spread of net- (2) Stream type: Sequential high-speed analysis of works, attention is being focused on the large quanti- data stream being continuously generated with- ties of a wide variety of data being generated and out accumulation (realtime processing) transmitted as big data. This trend is accelerating [1] With case (2) in particular, the ambiguity inherent as a result of advances in information and communi- in the environment is creating a growing need to cations technology (ICT), which simplifies the col- make judgments and decisions on the basis of insuf- lection and analysis of big data. ficient information. Even now, there is a strong need to make new dis- In this article, we revisit Jubatus, a framework coveries and find patterns that were not noticed intended for realtime stream-type big data analysis before, by taking huge amounts of data in a certain that provides profound analysis ability by online area that has been stockpiled, such as a dozen or so machine learning as added value. Jubatus was intro- years of clinical data, and analyzing it from all angles, duced in the June 2012 issue of NTT Technical as one scenario in which big data is put to good use in Review [1]. That article focused on a key mechanism business. This trend is not limited to operations con- called mix. In this article, after briefly reviewing fined to a specific area; it includes the possibility of Jubatus, we describe the technical challenges and swiftly finding new business possibilities or synergis- goals that should be resolved and introduce our tic effects by focusing on relationships in big data design concept, open source community activities, spanning different fields or specialist areas. We think that this trend will expand into cross-domain big data *1 Cross domain: Access to data spanning different domains. analysis in the future*1. http://en.wikipedia.org/wiki/Cross-domain_solution 1 NTT Technical Review Feature Articles achievements to date, and future plans for expanding design concept, under which information for judg- Jubatus. ment is being collected continuously and decisions are being made without lag, we separate the profound 2. Jubatus analysis (functionality) from the scalability (non- functionality). More specifically, a profound analysis 2.1 Overview design likens the logic of online machine learning to Jubatus is a scalable distributed computing frame- an engine or central processing unit (CPU), which work for online machine learning. The origin of the can be continuously upgraded as removable analysis name is the Latin term for that agile animal the chee- modules. A scalable design is seen as a common tah. It is pronounced “yu-ba-tus”. Developed jointly infrastructure chassis or motherboard, which can be by Preferred Infrastructure Corporation and NTT scaled by installing analysis modules into the com- Software Innovation Center, it is currently published mon framework. The ultimate goal of Jubatus is to on websites as a Japan-originated big data open provide everyone with scalable machine learning. source project [2]–[4]. Our policy is to provide a broadly easy-to-understand The goal of Jubatus is to enable swift and profound online machine learning framework for big data that analysis of stream-type big data. One example of its is easy to use, with hardware that scales out over use is as a social media application for automatically inexpensive commodity servers to enable massively classifying the huge number of tweets*2 (over 8000 parallel distributed processing and software that is per second) generated all over the world. This pro- not restricted to a few data scientists, programmers, cessing includes the three requirements: large vol- and specialists. ume, fast, and profound. In other words, it supports natural language processing and automatic multicat- 2.2 Applicable areas egory classification, at high speed without lag, of a Google’s PageRank is well known as a technique data stream of 16 Mbit/s. for calculating the importance of web pages from all However, these three requirements basically have a over the world [5]. Since website link structures are trade-off relationship and it is intrinsically difficult to updated comparatively slowly, batch analysis is suit- satisfy all of them simultaneously. Jubatus satisfies able in this case. Social media such as Twitter tweets, both profound analysis and scalability. Here, pro- on the other hand, are characterized by having finer found analysis is the automatic categorization of granularity than the web, by having light content, by unstructured information intended for human beings, having small chunks of data that are transmitted in such as natural language. It can also replace human large numbers from a wider demographic than that of labor for unclearly formulated processing work, such web users, and by the immediacy of information as recommendation, prediction, and relationship dis- being vital. Through social media, we can analyze covery. From the technical perspective, it involves how a broad user population reacts to changes and challenges in the areas of machine learning, artificial events in the external environment and make effective intelligence, and pattern recognition. practical use of realtime analysis of stream-type big On the other hand, scalability encompasses the data, with cases applied to business enterprises. issues of (1) increases in processing requests and (2) We have provided primary classifications of Twitter increases in data size. Issue (1) can be further divided information as an application of Jubatus. Since the into throughput (volume of requests to be handled per information representation is in natural language and unit time) and response (response for each instance, is presented with a limited number of characters, without lag). In general, batch processing focuses on expressions having distinctive omissions, abbrevia- throughput while realtime processing focuses on tions, coined words, jargon, and mannerisms were response. Approaches to issue (2) are either process- included in the analysis subject matter. From the over ing the data without waiting or dividing and storing 2000 tweets per second in the Japanese language it. alone, it is difficult for humans themselves to extract Jubatus answers the need for making prompt judg- and organize useful information instantly by sight. ments in situations exhibiting uncertainty and ambi- Therefore, general-purpose primary filters that pro- guity. After our comprehensive investigation of the vide major classifications and refining functions are useful. Using the approximately 1600 companies *2 Tweets: short messages sent using the popular social networking listed on the first section of the Tokyo Stock Exchange service and microblogging service Twitter. as the classification categories, Jubatus speedily Vol. 10 No. 12 Dec. 2012 2 Feature Articles Development of High-performance engine + Machine learning engine high-speed framework high-speed framework Data Data Data Stream-type big data Classification Regression Recommendation Classification Regression Recommendation + Analysis Analysis Analysis results results results Online execution of profound analysis Fig. 1. Outline of the design concept. classifies the 2000 tweets per second into their cor- permitted range tolerated [6]. June 2012 saw the responding business categories and supplies informa- release of version 0.3 of Jubatus and we are planning tion to an analysis application. This implementation to augment the lineup of machine learning functions uses an online machine learning technique called and strengthen the support framework. The types of Jubatus classifiers (multi-valued classification). We machine learning that we currently support so far are used highly trustworthy published information, such outlined in Table 1. as Wikipedia, as training data and automatically con- Jubatus will be useful for applications that require structed a learning model for classification. Unlike speedy judgment. It is expected to be able to discover item classification by keyword matching or related and analyze original relationships among the data words, we use online learning to update the vector volume from different domains. equations of separate planes that divide an n-dimen- sional feature vector space*3 into a number of catego- 2.4 Distributed processing architecture ries. The distributed processing architecture is shown in We can also

NTT Technical Review, December 2012, Vol. 10, No. 12

Outline of Machine Learning

Sensorbee Documentation Release 0.4

Data Analysis from Scratch with Python: the Complete Beginner's

Machine Learning for Computer Security Fourteenforty Research Institute, Inc

“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”

Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases

Comprehensive Analysis of Data Mining Tools S

Pysad: a Streaming Anomaly Detection Framework in Python

Distributed Decision Tree Learning for Mining Big Data Streams