Overview of Machine Learning Tools and Libraries

Overview of Machine Learning Tools and Libraries

Overview of Machine Learning Tools and Libraries Daniel Pop, Gabriel Iuhasz Institute e-Austria Timi¸soara Bd. Vasile P^arvan No. 4, 300223 Timi¸soara,Rom^ania E-mail: fdanielpop, [email protected] Abstract tivation behind this work is represented by the fact that there are no recent, similar surveys available, the most Over the last three decades many general-purpose recent one found by the authors being a data mining machine learning frameworks and libraries emerged survey more than 4 years old [5]. from both academia and industry. The aim of this Some words on methodology are necessary since overview is to survey the market of ML tools and li- ML domain, rebranded some decade ago in Data Min- braries and to compare them in terms of features and ing, Knowledge Discovery or alike, produced a lot of supported algorithms. As there is a large number of so- projects, libraries, tools and frameworks. We are not lutions available offering a large spectrum of features, aiming to review all available frameworks for ML ever we will firstly introduce a set of criteria, grouped in four created, rather to reach the most used and active ones. categories, for both pruning and comparing the candi- For example, Machine Learning Open Source Software dates. Based on these criteria, we will synthetically repository (mloss.org) lists over 400 entries at the date present the results in tables and we shortly discuss the of this paper (summer 2011). Some of these entries findings in each category. are Weka / R packages and addons, or refer to specific problem domains (biology, mathematics etc). Even ig- noring these entries, we are left with hundreds of pack- 1 Introduction ages. And these are only the open source ones. There- fore, we need a systematic approach to make a narrower Given the enormous growth of collected and avail- selection. able data in companies, industry and science, tech- In the first phase, we needed to identify what niques for analyzing such data are becoming ever more specifc domain repositories, public dissemination chan- important. Research in machine learning (ML) com- nels, previous surveys and similar works are avail- bines classical questions of computer science (efficient able to start with. For example, searching for sur- algorithms, software systems, databases) with elements veys about machine learning on popular Web search from artificial intelligence and statistics up to user ori- engines (Google, Bing, Yahoo! Search) returned no ented issues (visualization, interactive mining, user as- valuable results. Refining the search to 'data mining sistance and smart recommendations). Over the last survey', few useful results were returned [3, 4, 5], most three decades, many general purpose machine learning recent one being over 4 years old (details about these frameworks, as well as special purpose machine learn- related papers are given in section 2). Other sources ing libraries, such as for phishing detection [1] or speech of candidates for initial list are the results from polls processing [2], as emerged from both academia and in- and surveys conducted by popular, independent on- dustry. In this survey, we will only consider the general line bodies, such as Rexer Analytics [6] or KD Nuggets purpose frameworks. [8]. At the end of phase one, the initial list was in- The objectives of this work are driven by the scope cluding more than 80 candidates { standalone tools, and objectives of an ongoing initiative: design a dis- plugings and libraries { originating from different do- tributed, open source system for scientific problem mains and providers, such as relational database sys- solving. In this context, we are particularly interested tems providers (Oracle Data Miner, Microsoft SQL in aspects such as usability, ability to handle large data Server, IBM Intelligent Miner, IBM SPSS Modeler), sets from various sources, interoperability with other li- mathematics and statistics software (MATLAB, Math- braries, distributed computing support. A second mo- ematica, MathSoft S-Plus, Statistica, R), data min- 1 ing software providers (RuleQuest C5.0/See5/Cubist, Salford Systems CART/SPM) or academia (KNIME, Table 1. ML tools usage Weka, Orange). KD Nuggets [9] Rexer Analytics [7] In the second phase, we pruned the initial list of candidates by removing outdated candidates. Out of 1 R 31% R 63 distinct products covered by the 3 previous surveys 2 Excel 30% SAS [3, 4, 5], 44 (70%) were flaged as outdated1, out of 3 RapidMiner 27% IBM SPSS Statis- which 27 (61%) were coming from industry and the rest tics from research organisations. The final list was com- 4 KNIME 22% Weka pleted by adding software employing best class neural 5 Weka 15% StatSoft Statistica network implementations because of large applicability 6 StatSoft Statistica 14% RapidMiner of these methods in modern systems. In the end, we 7 SAS 13% MATLAB selected a final list of 30 libraries and tools for review, 8 Rapid Analytics 10% IBM SPSS Mod- out of over 100 candidates considered. eler 9 MATLAB 10% MS SQL Server The paper is organised as follows. Next section 10 IBM SPSS Statis- 8% SAS Enterprise shortly reviews latest available similar surveys and tics Miner presents the findings of the most recent online polls 11 IBM SPSS Mod- 7% KNIME conducted by renown independent organisations. Sec- eler tion 3 details the criteria used in this survey for tools 12 SAS Enterprise 6% C4.5/C5/See5 and libraries evaluation and comparison, as well as the Miner rationale behind their selection. Section 4 discusses the 13 Orange 5% Mathematica main findings of this survey, while the last one presents 14 MS SQL Server 5% Minitab our conclusions and future work. 15 Other free DM 5% Salford Systems software 2 Previous work In the paper [3], from DataMining Lab and pre- The latest one available [7] (2011) shows that R system sented at 4th International Conference on KDD continues to dominate the market (47%), while Stat- (KDD98), the authors present a comparison of 17 lead- Soft Statistica, which has been climbing in the rank- ing DM tools at that time. Most of those tools (13, ings, is selected as the primary data mining tool by i.e. 77%) disappeared from the market (e.g. Unica the most data miners. Data miners report using an Technologies, DataMindCorp), were acquired by other average of 4 software tools overall. Statistica, Knime, companies and then abandoned (e.g. Thniking Ma- RapidMiner and Salford Systems received the strongest chine and Integral Solutions acquired by IBM) or sim- satisfaction ratings in 2011. Another important on- ply users lost interest in them and no recent versions line survey source is the latest KD Nuggets report [9] were issued (e.g. WizWhy and WizRule from Wiz- (2011). Table 1 shows the results on ML tools usage 2 Soft). One year after, in 1999, Goebel et al. [4] pull from both Rexer Analytics and KD Nuggets. together a very interesting survey of DM tools present- ing 43 products, but we can observe exactly the same 3 Properties considered in this survey outdaing ratio as in previous study (77%), leaving only 10 survivors. In their paper, the authors even identified Back in 1998, Kurt Threaling [10] identified sev- similar online survey projects, which unfortunatelly are eral challenges ahead of data mining software tools: not maintained anylonger, except for KD Nuggets [8]. database integration, automated model scoring, ex- A more recent survey [5] (2007) focuses only on open- porting models to other applications, business tem- source software systems for data mining. Being a more plates, effort knob, incorporate financial information, recent study, the outdating ratio is better { only (50%) computed target columns, time-series data, use vs. { and thus 6 out of 12 projects are still alive. view and wizards. Inline with the objectives and moti- Starting with 2007, Rexer Analytics [6] is conduct- ing yearly, on-line surveys on Data Mining tools usage. 2Rexer Analytics and KD Nuggets surveys are open, on-line surveys so that big players may use their channels to include more 1In this context, we condisered a product as outdated if no votes or positive feedback for one tool or another. Although they new versions of the product were released after 2010, or we don't mirror with 100% accuracy the market, it is very unlikely couldn't find it on the web at all. that important players were missed by these reports. 2 vation of our survey, we will consider two of these chal- lenges and we will evaluate how are they implemented Table 2. ML methods Zheng (2010) [11] Rexer Analytics in the reviewed products. (2011) [7] The effort knob refers in general to the feedback the 1 C4.6 / Classification Regression system is giving to the end-user upon changing or tun- 2 K-means / Clustering Decision trees ing various parameters of the algorithm in order to ob- 3 SVM / Statistical Cluster analysis tain a more accurate prediction model. This kind of learning tweaking may increase the processing time by order of 4 Apriori / Association Time series magnitude. The relationship between parameters and analysis processing time is something a user should not care 5 EM / Statistical Neural networks about, instead the system shall provide constant feed- learning back regarding effort estimates so that users can easily 6 Page rank / Link min- Factor analysis see how costly (in terms of resources such as memory ing and processor time) the operation is. Alternatively, 7 Adaboost / Ensemble Text mining users shall be able to control the global behavior and learning resource consumption and the system shall adjust the 8 KNN / Classification Association rules parameters accordingly. For example, setting the effort level to a low value, the system should produce a model 9 Naive Bayes / Classi- SVM quickly, doing the best it can given the limited amount fication of time.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us