IBM Data Science Professional Certificate

What is Data Science?

Tools for Data Science

Agenda

The language of data science

Categories of data science

Open Source Tools for Data Science

Commercial Tools for Data Science

Cloud Based Tools for Data Science

Packages, API, Data sets and Model

Data Science Methodology

Python for Data Science and AI

Databases and SQL for Data Science

Data Analysis with Python

Data Visualization with Python

Machine Learning with Python What is Data Science? Tools for Data Science Tools for Data Science Agenda

Overview of data science toolings

Tasks a data scientist needs to perform

Top opensource and commercial tools for each task

How those tools overlap in functionality

Pros and Cons of each tool

How tools can address the whole data science pipepline Tools for Data Science The language of data science

The language we use depends on the problems we are trying to slove and many other factors.

Role in Data Science

1. Business Analyst

2. Database Engineer

3. Data Analyst

4. Data Engineer

5. Data Scientist

6. Research Scientist

7. Software Engineer

8. Statistician

9. Product Manager

10. Project Manager

Top 3 Languages for Data Science

Python

R

SQL

Python What makes python great

General Purpose

Large Standard Library

For Data Science

Scientific computing libraries like Pandas, NumPy, SciPy and MatPlotLib

For it has PyTorch, TensorFlow, Keras, Scikit-learn

Python can be used for Natural Language Processing (NLP), using the Natural Language

Toolkit (NLTK)

R

SQL

R is a free software Tools for Data Science Categories of data science

Categories of data science Tools for Data Science Open Source Tools for Data Science

Data management tools

MySQL

Postgresql

MongoDB

CouchDB

Cassandra

HadoopHDFS

Ceph

Elassticsearch

Data Integration and Transformation Tools (ETL)

Apache AirFlow: originally created by AirBNB;

KubeFlow, which enables you to execute data science pipelines on top of Kubernetes;

Apache Kafka, which originated from LinkedIn;

Apache Nifi, which delivers a very nice visual editor;

Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of

1000s of nodes)

NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that

it even runs on small devices like a Raspberry Pi.

Data Visuallization Tools

Hue, which can create visualizations from SQL queries.

Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the

data provider). Apache Superset is a data exploration and visualization web application.

Model Deployment

Apache PredictionIO currently only supports Apache Spark ML models for deployment, but

support for all sorts of other libraries is on the roadmap.

Seldon is an interesting product since it supports nearly every framework, including

TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and

Redhat OpenShift.

MLEap: Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can

serve any of its models using the TensorFlow service. You can deploy to an embedded

device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web

browser using TensorFlow dot JS. Model monitoring is another crucial step.

Model Monitoring and Assessment

Code Asset Management

GitHub

Git

GitLab

Development Environment

Jupyter NotesBook

JupyterLab

Apache Zeppelin

R Studio

Spyder

Execution Environment

Apache Spark

Apache Flink

RiseLab Ray Fully Integrated and Visual Tools

Knime

Orange Tools for Data Science Commercial Tools for Data Science Tools for Data Science Cloud Based Tools for Data Science Tools for Data Science Packages, API, Data sets and Model

Libraries for Data Science

Python Libraries

Scientific computing Libraries in Python

Visualization Libraries in Python

High-level Libraries and Deep Learning

Deep Learning Libraries in Python.

Scientific Computing Libraries in Python

Panda: data structures & tools

NumPy: Arrays and Matrices

Visualization Libraries in Python

MatPlotLib: plots & Graphs

Seaborn: plots head map, time series, violin plots

Machine Learning & Deep Learning

Scikit-learn: regression, classification

Keras: Deep learning neural networks

Tensorflow: deep learning production and deployment

PyTorch: Regression, classification Apache Spark

Scala Libraries Data Science Methodology Python for Data Science and AI Databases and SQL for Data Science Data Analysis with Python Data Visualization with Python Machine Learning with Python