<<

Presented for the ICEAA Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ RapidMiner - Introduction

• RapidMiner is a company in Boston, MA founded in 2006. • Offers several products, but primary product is RapidMiner Studio. • Offers community and paid editions. • Gartner has RapidMiner listed as a Visionary. • Pros being large-active community, end-to-end life cycle management features, and integration with other tools. • Cons are market growth isn’t the same as competitors and potentially expensive. • GUI interface, drag/drop components to build processes.

1 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ RapidMiner - Example

2 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ RapidMiner – Bryan’s Review

• Many concepts to know and . • Learning curve not as steep as Excel, but generally, not as shallow as implementing with languages. • Great interactive tutorials help guide through many of the important features. • Really shines with Auto Model. This feature creates several ML models automatically for fast(er) insights and model iterations. • Model runs took ~1hr that would take months to generate manually.

3 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python -

• Python initially created in early 1990 by Guido van Rossum which has evolved into a popular open source language.

• Designed for readability (see the Zen of Python – PEP 20)

• Characteristics: • 4th Generation language • Supported on many OS • Object-Oriented • Large online community and information

4 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – Suggestion for Packages to Know

Name Level Area Comments Pandas New Exploratory Analysis 2D data structuring (similar to data frames) and many utility functions for quick insights. Numpy Intermediate All Analysis Provides core objects for several packages. Scipy Intermediate Models/Algos Statistical/probability and numerical features. Scikit-learn New Models/Algos Un/Supervised algorithms Beautifulsoup New IO Read XML and other formats. Matplotlib New Reporting Plot data Seaborn Intermediate Reporting More aesthetic plots Requests Intermediate IO Read HTTP data Psycopg2 Intermediate IO Read Postgresql data Sqlalchemy Advance IO ORM for databases Datetime New All Analysis Datetime objects (w/ time zones) Typing Advance Reporting In-source documentation networkx Intermediate Models/Algos Graph package

5 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – ML Packages

Name Commits Contributors Notes Keras 5342 816 Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Pytorch 25857 1365 PyTorch is an open source machine learning library based on the Torch library,[1][2][3] used for applications such as computer vision and natural language processing.

TensorFlow 83868 2467 TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. Xgboost 4167 404 XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. Caffe 4156 266 Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license. Theano 28120 332 Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs. FeatureTools 530 42 Featuretools is a python library for automated feature engineering. Statsmodels 12791 217 statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. Cntk 16116 198 The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes neural networks as a series of computational steps via a directed graph.

6 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – Backup

7 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – Process Models for Data Analysis

Fig 2. From “PDCA. Retrieved fromhttps://en.wikipedia.org/wiki/DMAIC Define

Plane

Control Measure

• Set Goals Build Act Do • Deliver • Explore • Revise • Plan • Wrangle • Wrap Up • Analyze Improve Analyze • Assess Check/Stu • Engineer dy • Optimize • Execute Fig 6. From “CRISP-DM”. Retrieved from Prepare Finish Fig 1. From “PDCA. https://en.wikipedia.org/wiki/Cross- Retrieved fromhttps://en.wikipedia.org/wiki/PDCA industry_standard_process_for_data_mining

Fig 3. Godsey, Brian. Think Like a Data Scientist: Tackle the data science process step-by- step, Manning Publications, 2017. O’Reilly, https://learning.oreilly.com/library/view/think-like-a/9781633430273/.

Fig 5. From “The Data Science Process” by Springboard, 2016. Retrieved from Fig 4. U.S. Government Accountability Office. (2009) GAO Cost Estimating And https://www.kdnuggets.com/2016/03/data-science-process.html. Assessment Guide (GAO-09-3SP). Retrieved from https://www.gao.gov/new.items/d093sp.pdf

8 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python - IDEs

Name Ease of Cost Features Comments Use PyCharm – Medium Free Medium Good out of the box, many of features but Community missing Scientific mode and other features.

PyCharm – Medium Medium High One of the most popular IDEs. Very robust, can Professional get probably do it if needed.

Spyder (Anaconda) Medium ? Medium? I’ve had a good experience with it. Geared for analysis rather than development. Only con was new IDE (for me) and no easy way for various key bindings. Notepad++ Easy Free Low-Medium Old faithful, always needed for something.

Komodo IDE ? ? ? Never used but seem like it has good potential.

Visual Studio Code ? ? ? Never used but seem like it has good potential.

Atom ? ? ? Never used but popular with developers.

Jupyter Notebook Easy Free Medium Not as many software engineering features as above, but one of the best for reporting and sharing. Hard Free Medium Fun to use if you have time.

9 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – Package Mind Map

10 Presented for the ICEAA Machine Learning Working Group, April 22, 2020 http://www.iceaaonline.com/mlgroup/ Python – Package Matrix

11