IBM Data Science Professional Certificate
What is Data Science?
Tools for Data Science
Agenda
The language of data science
Categories of data science
Open Source Tools for Data Science
Commercial Tools for Data Science
Cloud Based Tools for Data Science
Packages, API, Data sets and Model
Data Science Methodology
Python for Data Science and AI
Databases and SQL for Data Science
Data Analysis with Python
Data Visualization with Python
Machine Learning with Python What is Data Science? Tools for Data Science Tools for Data Science Agenda
Overview of data science toolings
Tasks a data scientist needs to perform
Top opensource and commercial tools for each task
How those tools overlap in functionality
Pros and Cons of each tool
How tools can address the whole data science pipepline Tools for Data Science The language of data science
The language we use depends on the problems we are trying to slove and many other factors.
Role in Data Science
1. Business Analyst
2. Database Engineer
3. Data Analyst
4. Data Engineer
5. Data Scientist
6. Research Scientist
7. Software Engineer
8. Statistician
9. Product Manager
10. Project Manager
Top 3 Languages for Data Science
Python
R
SQL
Python What makes python great
General Purpose
Large Standard Library
For Data Science
Scientific computing libraries like Pandas, NumPy, SciPy and MatPlotLib
For Artificial Intelligence it has PyTorch, TensorFlow, Keras, Scikit-learn
Python can be used for Natural Language Processing (NLP), using the Natural Language
Toolkit (NLTK)
R
SQL
R is a free software Tools for Data Science Categories of data science
Categories of data science Tools for Data Science Open Source Tools for Data Science
Data management tools
MySQL
Postgresql
MongoDB
CouchDB
Cassandra
HadoopHDFS
Ceph
Elassticsearch
Data Integration and Transformation Tools (ETL)
Apache AirFlow: originally created by AirBNB;
KubeFlow, which enables you to execute data science pipelines on top of Kubernetes;
Apache Kafka, which originated from LinkedIn;
Apache Nifi, which delivers a very nice visual editor;
Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of
1000s of nodes)
NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that
it even runs on small devices like a Raspberry Pi.
Data Visuallization Tools
Hue, which can create visualizations from SQL queries.
Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the
data provider). Apache Superset is a data exploration and visualization web application.
Model Deployment
Apache PredictionIO currently only supports Apache Spark ML models for deployment, but
support for all sorts of other libraries is on the roadmap.
Seldon is an interesting product since it supports nearly every framework, including
TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and
Redhat OpenShift.
MLEap: Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can
serve any of its models using the TensorFlow service. You can deploy to an embedded
device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web
browser using TensorFlow dot JS. Model monitoring is another crucial step.
Model Monitoring and Assessment
Code Asset Management
GitHub
Git
GitLab
Development Environment
Jupyter NotesBook
JupyterLab
Apache Zeppelin
R Studio
Spyder
Execution Environment
Apache Spark
Apache Flink
RiseLab Ray Fully Integrated and Visual Tools
Knime
Orange Tools for Data Science Commercial Tools for Data Science Tools for Data Science Cloud Based Tools for Data Science Tools for Data Science Packages, API, Data sets and Model
Libraries for Data Science
Python Libraries
Scientific computing Libraries in Python
Visualization Libraries in Python
High-level Machine Learning Libraries and Deep Learning
Deep Learning Libraries in Python.
Scientific Computing Libraries in Python
Panda: data structures & tools
NumPy: Arrays and Matrices
Visualization Libraries in Python
MatPlotLib: plots & Graphs
Seaborn: plots head map, time series, violin plots
Machine Learning & Deep Learning
Scikit-learn: regression, classification
Keras: Deep learning neural networks
Tensorflow: deep learning production and deployment
PyTorch: Regression, classification Apache Spark
Scala Libraries Data Science Methodology Python for Data Science and AI Databases and SQL for Data Science Data Analysis with Python Data Visualization with Python Machine Learning with Python