Data Science Primer Documentation

Data Science Primer Documentation Team Mar 19, 2019 Technical 1 Bash 3 2 Big Data 5 3 Databases 7 4 Data Engineering 11 5 Data Wrangling 13 6 Data Visualization 15 7 Deep Learning 17 8 Machine Learning 19 9 Python 21 10 Statistics 23 11 SQL 25 12 Glossary 27 13 Business 31 14 Ethics 33 15 Mastering The Data Science Interview 35 16 Learning How To Learn 43 17 Communication 47 18 Product 49 19 Stakeholder Management 51 20 Datasets 53 i 21 Libraries 69 22 Papers 99 23 Other Content 105 24 Contribute 111 ii Data Science Primer Documentation The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associated with Data Science. Typically educational resources focus on technical topics. However, in reality, mastering the non-technical topics can lead to even greater dividends in your professional capacity as a data scientist. Since the task of a Data Scientist at any company can be vastly different (Inference, Analytics, and Algorithms ac- cording to Airbnb), there’s no real roadmap to using this primer. Feel free to jump around to topics that interest you and which may align closer with your career trajectory. Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may be expected to have proficiency in. It’s not meant to serve as a replacement for fundamental courses that contribute to a Data Science education, nor as a way to gain mastery in a topic. Warning: This document is under early stage development. If you find errors, please raise an issue or contribute a better definition! Technical 1 Data Science Primer Documentation 2 Technical CHAPTER 1 Bash • Introduction • Subj_1 1.1 Introduction xx 1.2 Subj_1 Code Some code: print("hello") References 3 Data Science Primer Documentation 4 Chapter 1. Bash CHAPTER 2 Big Data • Introduction • Subj_1 https://www.quora.com/What-is-a-Hadoop-ecosystem 2.1 Introduction The tooling and methodologies to deal with the phenonmon of big data was originally created in response to the exponential growth in data generation (TODO: INSERT quote about data generation). When talking about big data we mean (insert what we mean). We’ll be focusing exclusively on the Apache Hadoop Ecosystem, which is [insert something]. It contains many different libraries that help achieve different aims. 5 Data Science Primer Documentation Why It’s Important: While a data scientist may not specifically be involved with the specifics of managing big data and performing Extract, Transform and Load tasks, they’ll likely be responsible for querying sources where this data is stored at a minimum. Ideally, a data scientist should be familar with how this data is stored, how to query it, and how to transform it (e.g. Spark, Hadoop, Hive would be three key technologies to be familar with). 2.2 Subj_1 Code Some code: print("hello") References 6 Chapter 2. Big Data CHAPTER 3 Databases • Introduction • Relational – Types – Advantages – Disadvantages • Non-relational/NoSQL – Types – Advantages – Disadvantages • Terminology 3.1 Introduction Databases in the most basic terms, serve as a way of storing, organizing and retrieving information. The two main kinds of databases are Relational and Non-Relational (NoSQL). Within both kinds of databases are a variety of other database sub-types which the following sections will explore in detail. Why It’s Important: Often times a data scientist will have to work with a database in order to retrieve the data they need to build their models or to do data analysis. Thus it’s important that a scientist familiarize themselves with the different types of databases and how to query them. 7 Data Science Primer Documentation 3.2 Relational A relational database is a collection of tables and the relationships that exist between these tables. These tables contain columns (table attributes) with rows that represent the data within the table. You can think of the column as the header indicating the name of the data being stored, and the row being the actual data points that are stored in the database. Often times these rows will have a unique identifier associated with them, called primary keys. Relationships between tables are created using foreign keys, whose primary function is to join tables together. To get data out of a relational database, you would use SQL, with different relational databases using slightly different versions of SQL. When interacting with the database using SQL, whether creating a table or querying it, you are creating what is called a Transaction. Transactions in relational databases represent a unit of work performed against the database. An important concept with transactions in relational databases is maintaining ACID, which stands for: • Atomicity: A transaction typically contains many queries. Atomicity guarantees that if one query fails, then the whole transaction fails, leaving the database unchanged. • Consistency: This ensures that a transaction can only bring a database from one valid state to another, meaning a transaction cannot leave the database in a corrupt state. • Isolation: Since transactions are executed concurrently, this property ensures that the database is left in the same state as if the transactions were executed sequentially. • Durability: This property gurantees that once a transaction has been committed, its effects will remain regardless of any system failure. 3.2.1 Types While there are many different kinds of relational databases, the most popular are: • PostgreSQL: An open-source object-relational database management system that is ACID-compliant and trans- actional. • Oracle: A multi-model database management system that is produced and marketed by Oracle. • MySQL: An open-source relational database management system with a proprietary paid version available for additional functionality. • Microsoft SQL Server: A relational database management system developed by Microsoft. • Maria DB: A MySQL compabtable database engine forked from MySQL. 3.2.2 Advantages • SQL standards are well defined and commonly accepted • Easy to categorize and store data that can later be queried • Simple to understand, since the table structure and relationships are intuitive ato most users • Data integrity, through strong data typing and validity checks to make sure that data falls within an acceptable range 3.2.3 Disadvantages • Poor performance with unstructured data types due to schema and type constraints 8 Chapter 3. Databases Data Science Primer Documentation • Can be slow and not scalable compared to NoSQL • Unable to map certain kinds of data such as graphs 3.3 Non-relational/NoSQL Non-relational databases were developed from a need to deal with the exponential growth in data that was being gathered and processed. Additionally dealing with scalability, multi-structured data and geo-distribution were a few more reasons that NoSQL was created. There are a variety of NoSQL implementations, each with their own approach to tackling these problems. 3.3.1 Types • Key-value Store: One of the simpler NoSQL stores that works by assigning a value for each key. The database then uses a hash table to store unique keys and pointers for each data value. They can be used to store user session data, and examples of these types of databases include Redis and Amazon Dynamo. • Document Store: Similar to a key-value store, however in this case the value contains structured or semi- structured data and is referred to as a document. A use case for document store could be for a blogging platform. Examples of document stores include MongoDB and Apache CouchDB. • Column Store: In this implementation, data is stored in cells grouped in columns of data rather than rows of data, with each column being grouped into a column family. These can be used in content management systems, and examples of these types of databases include Cassandra and Apache Hbase. • Graph Store: These are based on the Entity - Attribute - Value model. Entities will have associated attributes and subsequent values when data is inserted. Nodes will store data about each entity, along with the relationships between nodes. Graph stores can be used in applications such as social networks, and examples of these types of databases include Neo4j and ArangoDB. 3.3.2 Advantages • High availability • Schema free or schema-on-read options • Ability to rapidly prototype applications • Elastic scalability • Can store massive amounts of data 3.3.3 Disadvantages • Since most NoSQL databases use eventual consistency instead of ACID, there may be a risk that data may be out of sync • Less support and maturity in the NoSQL ecosystem 3.4 Terminology Query A query can be thought of as a single action that is taken on a database 3.3. Non-relational/NoSQL 9 Data Science Primer Documentation Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database. ACID Atomicity, Consistency, Isolation, Durability Schema A schema is the structure of a database Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactions as well as data stored. The two main types are vertical scalability, which is concerned with adding more capacity to a single machine by adding additional RAM, CPU, etc. Horizontal scalability has to do with adding more machines and splitting the work amongst them. Normalization This is a technique of organizing tables within a relational database. It involves splitting up data into seperate tables to reduce redundancy and improve data integrity. Denormalization This is a technique of organizing tables within a relational database. It involves combining tables to reduce the number of JOIN queries. References • https://dzone.com/articles/the-types-of-modern-databases • https://www.mongodb.com/nosql-explained • https://www.channelfutures.com/cloud-2/the-limitations-of-nosql-database-storage-why-nosqls-not-perfect • https://opensourceforu.com/2017/05/different-types-nosql-databases/ 10 Chapter 3.

Data Science Primer Documentation

A Meaningful Mapping Approach for the Complex Design

Exploring the Role of Language in Two Systems for Categorization Kayleigh Ryherd University of Connecticut - Storrs, [email protected]

Naiad: a Timely Dataflow System

A Framework for Access and Use of Documentary Heritage at the National Archives of Zimbabwe

ML Cheatsheet Documentation

Claus-Christian Carbon

Neural Processing of Facial Expressions As Modulators of Communicative Intention Facial Expression in Communicative Intention

1.2 Le Zhang, Microsoft

Alibaba Amazon

Asana Is Marketed As a Project Management Software That Allows Collaborators to Work Together on Multiple Projects

Microsoft Recommenders Tools to Accelerate Developing Recommender Systems Scott Graham, Jun Ki Min, Tao Wu {Scott.Graham, Jun.Min, Tao.Wu} @ Microsoft.Com

Digitale Toolbox