IBM Data Science Professional Certificate
Total Page:16
File Type:pdf, Size:1020Kb
IBM Data Science Professional Certificate What is Data Science? Tools for Data Science Agenda The language of data science Categories of data science Open Source Tools for Data Science Commercial Tools for Data Science Cloud Based Tools for Data Science Packages, API, Data sets and Model Data Science Methodology Python for Data Science and AI Databases and SQL for Data Science Data Analysis with Python Data Visualization with Python Machine Learning with Python What is Data Science? Tools for Data Science Tools for Data Science Agenda Overview of data science toolings Tasks a data scientist needs to perform Top opensource and commercial tools for each task How those tools overlap in functionality Pros and Cons of each tool How tools can address the whole data science pipepline Tools for Data Science The language of data science The language we use depends on the problems we are trying to slove and many other factors. Role in Data Science 1. Business Analyst 2. Database Engineer 3. Data Analyst 4. Data Engineer 5. Data Scientist 6. Research Scientist 7. Software Engineer 8. Statistician 9. Product Manager 10. Project Manager Top 3 Languages for Data Science Python R SQL Python What makes python great General Purpose Large Standard Library For Data Science Scientific computing libraries like Pandas, NumPy, SciPy and MatPlotLib For Artificial Intelligence it has PyTorch, TensorFlow, Keras, Scikit-learn Python can be used for Natural Language Processing (NLP), using the Natural Language Toolkit (NLTK) R SQL R is a free software Tools for Data Science Categories of data science Categories of data science Tools for Data Science Open Source Tools for Data Science Data management tools MySQL Postgresql MongoDB CouchDB Cassandra HadoopHDFS Ceph Elassticsearch Data Integration and Transformation Tools (ETL) Apache AirFlow: originally created by AirBNB; KubeFlow, which enables you to execute data science pipelines on top of Kubernetes; Apache Kafka, which originated from LinkedIn; Apache Nifi, which delivers a very nice visual editor; Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute clusters of 1000s of nodes) NodeRED, which also provides a visual editor. NodeRED consumes so little in resources that it even runs on small devices like a Raspberry Pi. Data Visuallization Tools Hue, which can create visualizations from SQL queries. Kibana, a data exploration and visualization web application, is limited to Elasticsearch (the data provider). Apache Superset is a data exploration and visualization web application. Model Deployment Apache PredictionIO currently only supports Apache Spark ML models for deployment, but support for all sorts of other libraries is on the roadmap. Seldon is an interesting product since it supports nearly every framework, including TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of Kubernetes and Redhat OpenShift. MLEap: Another way to deploy SparkML models is by using MLeap. Finally, TensorFlow can serve any of its models using the TensorFlow service. You can deploy to an embedded device like a Raspberry Pi or a smartphone using TensorFlow Lite, and even deploy to a web browser using TensorFlow dot JS. Model monitoring is another crucial step. Model Monitoring and Assessment Code Asset Management GitHub Git GitLab Development Environment Jupyter NotesBook JupyterLab Apache Zeppelin R Studio Spyder Execution Environment Apache Spark Apache Flink RiseLab Ray Fully Integrated and Visual Tools Knime Orange Tools for Data Science Commercial Tools for Data Science Tools for Data Science Cloud Based Tools for Data Science Tools for Data Science Packages, API, Data sets and Model Libraries for Data Science Python Libraries Scientific computing Libraries in Python Visualization Libraries in Python High-level Machine Learning Libraries and Deep Learning Deep Learning Libraries in Python. Scientific Computing Libraries in Python Panda: data structures & tools NumPy: Arrays and Matrices Visualization Libraries in Python MatPlotLib: plots & Graphs Seaborn: plots head map, time series, violin plots Machine Learning & Deep Learning Scikit-learn: regression, classification Keras: Deep learning neural networks Tensorflow: deep learning production and deployment PyTorch: Regression, classification Apache Spark Scala Libraries Data Science Methodology Python for Data Science and AI Databases and SQL for Data Science Data Analysis with Python Data Visualization with Python Machine Learning with Python.