Data Science Primer Documentation

Total Page:16

File Type:pdf, Size:1020Kb

Data Science Primer Documentation Data Science Primer Documentation Team Mar 19, 2019 Technical 1 Bash 3 2 Big Data 5 3 Databases 7 4 Data Engineering 11 5 Data Wrangling 13 6 Data Visualization 15 7 Deep Learning 17 8 Machine Learning 19 9 Python 21 10 Statistics 23 11 SQL 25 12 Glossary 27 13 Business 31 14 Ethics 33 15 Mastering The Data Science Interview 35 16 Learning How To Learn 43 17 Communication 47 18 Product 49 19 Stakeholder Management 51 20 Datasets 53 i 21 Libraries 69 22 Papers 99 23 Other Content 105 24 Contribute 111 ii Data Science Primer Documentation The primary purpose of this primer is to give a cursory overview of both technical and non-technical topics associated with Data Science. Typically educational resources focus on technical topics. However, in reality, mastering the non-technical topics can lead to even greater dividends in your professional capacity as a data scientist. Since the task of a Data Scientist at any company can be vastly different (Inference, Analytics, and Algorithms ac- cording to Airbnb), there’s no real roadmap to using this primer. Feel free to jump around to topics that interest you and which may align closer with your career trajectory. Ultimately this resource is for anyone who would like to get a quick overview of a topic that a data scientist may be expected to have proficiency in. It’s not meant to serve as a replacement for fundamental courses that contribute to a Data Science education, nor as a way to gain mastery in a topic. Warning: This document is under early stage development. If you find errors, please raise an issue or contribute a better definition! Technical 1 Data Science Primer Documentation 2 Technical CHAPTER 1 Bash • Introduction • Subj_1 1.1 Introduction xx 1.2 Subj_1 Code Some code: print("hello") References 3 Data Science Primer Documentation 4 Chapter 1. Bash CHAPTER 2 Big Data • Introduction • Subj_1 https://www.quora.com/What-is-a-Hadoop-ecosystem 2.1 Introduction The tooling and methodologies to deal with the phenonmon of big data was originally created in response to the exponential growth in data generation (TODO: INSERT quote about data generation). When talking about big data we mean (insert what we mean). We’ll be focusing exclusively on the Apache Hadoop Ecosystem, which is [insert something]. It contains many different libraries that help achieve different aims. 5 Data Science Primer Documentation Why It’s Important: While a data scientist may not specifically be involved with the specifics of managing big data and performing Extract, Transform and Load tasks, they’ll likely be responsible for querying sources where this data is stored at a minimum. Ideally, a data scientist should be familar with how this data is stored, how to query it, and how to transform it (e.g. Spark, Hadoop, Hive would be three key technologies to be familar with). 2.2 Subj_1 Code Some code: print("hello") References 6 Chapter 2. Big Data CHAPTER 3 Databases • Introduction • Relational – Types – Advantages – Disadvantages • Non-relational/NoSQL – Types – Advantages – Disadvantages • Terminology 3.1 Introduction Databases in the most basic terms, serve as a way of storing, organizing and retrieving information. The two main kinds of databases are Relational and Non-Relational (NoSQL). Within both kinds of databases are a variety of other database sub-types which the following sections will explore in detail. Why It’s Important: Often times a data scientist will have to work with a database in order to retrieve the data they need to build their models or to do data analysis. Thus it’s important that a scientist familiarize themselves with the different types of databases and how to query them. 7 Data Science Primer Documentation 3.2 Relational A relational database is a collection of tables and the relationships that exist between these tables. These tables contain columns (table attributes) with rows that represent the data within the table. You can think of the column as the header indicating the name of the data being stored, and the row being the actual data points that are stored in the database. Often times these rows will have a unique identifier associated with them, called primary keys. Relationships between tables are created using foreign keys, whose primary function is to join tables together. To get data out of a relational database, you would use SQL, with different relational databases using slightly different versions of SQL. When interacting with the database using SQL, whether creating a table or querying it, you are creating what is called a Transaction. Transactions in relational databases represent a unit of work performed against the database. An important concept with transactions in relational databases is maintaining ACID, which stands for: • Atomicity: A transaction typically contains many queries. Atomicity guarantees that if one query fails, then the whole transaction fails, leaving the database unchanged. • Consistency: This ensures that a transaction can only bring a database from one valid state to another, meaning a transaction cannot leave the database in a corrupt state. • Isolation: Since transactions are executed concurrently, this property ensures that the database is left in the same state as if the transactions were executed sequentially. • Durability: This property gurantees that once a transaction has been committed, its effects will remain regardless of any system failure. 3.2.1 Types While there are many different kinds of relational databases, the most popular are: • PostgreSQL: An open-source object-relational database management system that is ACID-compliant and trans- actional. • Oracle: A multi-model database management system that is produced and marketed by Oracle. • MySQL: An open-source relational database management system with a proprietary paid version available for additional functionality. • Microsoft SQL Server: A relational database management system developed by Microsoft. • Maria DB: A MySQL compabtable database engine forked from MySQL. 3.2.2 Advantages • SQL standards are well defined and commonly accepted • Easy to categorize and store data that can later be queried • Simple to understand, since the table structure and relationships are intuitive ato most users • Data integrity, through strong data typing and validity checks to make sure that data falls within an acceptable range 3.2.3 Disadvantages • Poor performance with unstructured data types due to schema and type constraints 8 Chapter 3. Databases Data Science Primer Documentation • Can be slow and not scalable compared to NoSQL • Unable to map certain kinds of data such as graphs 3.3 Non-relational/NoSQL Non-relational databases were developed from a need to deal with the exponential growth in data that was being gathered and processed. Additionally dealing with scalability, multi-structured data and geo-distribution were a few more reasons that NoSQL was created. There are a variety of NoSQL implementations, each with their own approach to tackling these problems. 3.3.1 Types • Key-value Store: One of the simpler NoSQL stores that works by assigning a value for each key. The database then uses a hash table to store unique keys and pointers for each data value. They can be used to store user session data, and examples of these types of databases include Redis and Amazon Dynamo. • Document Store: Similar to a key-value store, however in this case the value contains structured or semi- structured data and is referred to as a document. A use case for document store could be for a blogging platform. Examples of document stores include MongoDB and Apache CouchDB. • Column Store: In this implementation, data is stored in cells grouped in columns of data rather than rows of data, with each column being grouped into a column family. These can be used in content management systems, and examples of these types of databases include Cassandra and Apache Hbase. • Graph Store: These are based on the Entity - Attribute - Value model. Entities will have associated attributes and subsequent values when data is inserted. Nodes will store data about each entity, along with the relationships between nodes. Graph stores can be used in applications such as social networks, and examples of these types of databases include Neo4j and ArangoDB. 3.3.2 Advantages • High availability • Schema free or schema-on-read options • Ability to rapidly prototype applications • Elastic scalability • Can store massive amounts of data 3.3.3 Disadvantages • Since most NoSQL databases use eventual consistency instead of ACID, there may be a risk that data may be out of sync • Less support and maturity in the NoSQL ecosystem 3.4 Terminology Query A query can be thought of as a single action that is taken on a database 3.3. Non-relational/NoSQL 9 Data Science Primer Documentation Transaction A transaction is a sequence of queries that make up a single unit of work performed against a database. ACID Atomicity, Consistency, Isolation, Durability Schema A schema is the structure of a database Scalability Scalability when databases are concerned has to do with how databases handle an increase in transactions as well as data stored. The two main types are vertical scalability, which is concerned with adding more capacity to a single machine by adding additional RAM, CPU, etc. Horizontal scalability has to do with adding more machines and splitting the work amongst them. Normalization This is a technique of organizing tables within a relational database. It involves splitting up data into seperate tables to reduce redundancy and improve data integrity. Denormalization This is a technique of organizing tables within a relational database. It involves combining tables to reduce the number of JOIN queries. References • https://dzone.com/articles/the-types-of-modern-databases • https://www.mongodb.com/nosql-explained • https://www.channelfutures.com/cloud-2/the-limitations-of-nosql-database-storage-why-nosqls-not-perfect • https://opensourceforu.com/2017/05/different-types-nosql-databases/ 10 Chapter 3.
Recommended publications
  • A Meaningful Mapping Approach for the Complex Design
    Volume 23 Number 2 Sibilla M (2017) A Editor-in-Chief: meaningful mapping Edwin Dado approach for the complex Khaldoun Zreik design, International Journal of Design Sciences Editors: and Technology, 23:2 41- Daniel Estevez 78 Mithra Zahedi ISSN 1630 - 7267 ISSN 1630 - 7267 © europia Productions, 2017 15, avenue de Ségur, 75007 Paris, France. Tel (Fr) 014551 26 07 -(Int.) +33 14551 26 07 Fax (Fr) 01 45 51 26 32- (Int.) +33 1 45 51 26 32 E-mail: [email protected] http://www.europia.org/ijdst International Journal of Design Sciences and Technology Volume 23 Number 2 ISSN 1630 - 7267 International Journal of Design Sciences and Technology Editor-in-Chief: Edwin Dado, NLDA, Netherlands Khaldoun Zreik, University of Paris 8, France Editors: Daniel Estevez, Toulouse University, France Mithra Zahedi, University of Montreal, Canada Editorial Board: ACHTEN, Henri (Czech Technical University, Prague, Czech Republic) AMOR, Robert (University of Auckland, New Zealand) AOUAD, Ghassan (Gulf University for Science and Technology, Kuwait) BAX, Thijs (Eindhoven University of Technology, Netherlands) BECUE, Vincent (Université de Mons, Belgium) BEHESHTI, Reza (Design Research Foundation, Netherlands) BONNARDEL, Nathalie (Université d’Aix Marseille, France) BOUDON, Philippe (EAPLV, France) BRANGIER, Eric (Université de Lorraine, France) CARRARA, Gianfranco (Università di Roma La Sapienza, Italy) EDER, W. Ernst (Royal Military College, Canada) ESTEVEZ, Daniel (Toulouse University, France) FARINHA, Fátima (University of Algarve, Portugal) FINDELI, Alain (Université
    [Show full text]
  • Exploring the Role of Language in Two Systems for Categorization Kayleigh Ryherd University of Connecticut - Storrs, [email protected]
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 4-26-2019 Exploring the Role of Language in Two Systems for Categorization Kayleigh Ryherd University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Ryherd, Kayleigh, "Exploring the Role of Language in Two Systems for Categorization" (2019). Doctoral Dissertations. 2163. https://opencommons.uconn.edu/dissertations/2163 Exploring the Role of Language in Two Systems for Categorization Kayleigh Ryherd, PhD University of Connecticut, 2019 Multiple theories of category learning converge on the idea that there are two systems for categorization, each designed to process different types of category structures. The associative system learns categories that have probabilistic boundaries and multiple overlapping features through iterative association of features and feedback. The hypothesis-testing system learns rule-based categories through explicit testing of hy- potheses about category boundaries. Prior research suggests that language resources are necessary for the hypothesis-testing system but not for the associative system. However, other research emphasizes the role of verbal labels in learning the probabilistic similarity-based categories best learned by the associative system. This suggests that language may be relevant for the associative system in a different way than it is relevant for the hypothesis-testing system. Thus, this study investigated the ways in which language plays a role in the two systems for category learning. In the first experiment, I tested whether language is related to an individual’s ability to switch between the associative and hypothesis-testing systems. I found that participants showed remarkable ability to switch between systems regardless of their language ability.
    [Show full text]
  • Naiad: a Timely Dataflow System
    Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard Paul Barham Mart´ın Abadi Microsoft Research Silicon Valley {derekmur,mcsherry,risaacs,misard,pbar,abadi}@microsoft.com Abstract User queries Low-latency query are received responses are delivered Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput Queries are of batch processors, the low latency of stream proces- joined with sors, and the ability to perform iterative and incremental processed data computations. Although existing systems offer some of Complex processing these features, applications that require all three have re- incrementally re- lied on multiple platforms, at the expense of efficiency, Updates to executes to reflect maintainability, and simplicity. Naiad resolves the com- data arrive changed data plexities of combining these features in one framework. A new computational model, timely dataflow, under- Figure 1: A Naiad application that supports real- lies Naiad and captures opportunities for parallelism time queries on continually updated data. The across a wide class of algorithms. This model enriches dashed rectangle represents iterative processing that dataflow computation with timestamps that represent incrementally updates as new data arrive. logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. requirements: the application performs iterative process- We show that many powerful high-level programming ing on a real-time data stream, and supports interac- models can be built on Naiad’s low-level primitives, en- tive queries on a fresh, consistent view of the results. abling such diverse tasks as streaming data analysis, it- However, no existing system satisfies all three require- erative machine learning, and interactive graph mining.
    [Show full text]
  • A Framework for Access and Use of Documentary Heritage at the National Archives of Zimbabwe
    A FRAMEWORK FOR ACCESS AND USE OF DOCUMENTARY HERITAGE AT THE NATIONAL ARCHIVES OF ZIMBABWE by FORGET CHATERERA submitted in accordance with the requirements for the degree of Doctor of Literature and Philosophy in the subject Information Science at the UNIVERSITY OF SOUTH AFRICA SUPERVISOR: PROFESSOR P. NGULUBE CO-SUPERVISOR: DOCTOR A. RODRIGUES July 2017 SUMMARY The study sought to develop a framework for access and use of documentary heritage at the National Archives of Zimbabwe (NAZ). This followed the realization that access and use is the raison d'être for national archival institutions yet the level of utilization has for a long time been lamented to be low. The researcher therefore took a holistic approach and investigated the issues of bibliographic, intellectual and physical access to archives. Using a grounded theory research approach framed within the constructivism ontology and the interpretivism epistemological research paradigm, this study employed the observation technique, interviews and content analysis to collect the empirical evidence that was needed to develop a framework for access and use of the documentary heritage in the custody of NAZ. The study extensively discussed the methodological issues involved in the study as grounded theory is a rarely used approach in Information Science studies. An extensive discussion was therefore offered to enable readers to follow and appreciate how the not so common approach was actually employed. The findings of the study showed that the position of NAZ as a public information resource centre was threatened by many obstacles that were compromising the accessibility and use of the documentary heritage in its custody.
    [Show full text]
  • ML Cheatsheet Documentation
    ML Cheatsheet Documentation Team Sep 02, 2021 Basics 1 Linear Regression 3 2 Gradient Descent 21 3 Logistic Regression 25 4 Glossary 39 5 Calculus 45 6 Linear Algebra 57 7 Probability 67 8 Statistics 69 9 Notation 71 10 Concepts 75 11 Forwardpropagation 81 12 Backpropagation 91 13 Activation Functions 97 14 Layers 105 15 Loss Functions 117 16 Optimizers 121 17 Regularization 127 18 Architectures 137 19 Classification Algorithms 151 20 Clustering Algorithms 157 i 21 Regression Algorithms 159 22 Reinforcement Learning 161 23 Datasets 165 24 Libraries 181 25 Papers 211 26 Other Content 217 27 Contribute 223 ii ML Cheatsheet Documentation Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for learning more. Warning: This document is under early stage development. If you find errors, please raise an issue or contribute a better definition! Basics 1 ML Cheatsheet Documentation 2 Basics CHAPTER 1 Linear Regression • Introduction • Simple regression – Making predictions – Cost function – Gradient descent – Training – Model evaluation – Summary • Multivariable regression – Growing complexity – Normalization – Making predictions – Initialize weights – Cost function – Gradient descent – Simplifying with matrices – Bias term – Model evaluation 3 ML Cheatsheet Documentation 1.1 Introduction Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types: Simple regression Simple linear regression uses traditional slope-intercept form, where m and b are the variables our algorithm will try to “learn” to produce the most accurate predictions.
    [Show full text]
  • Claus-Christian Carbon
    curriculum vitae Claus-Christian Carbon 25-May-13 1. Personal Data Name: Claus-Christian CARBON Date of Birth: 23 March 1971 Place of Birth: Schweinfurt, Federal Republic of Germany Profession: Full Professor Phone (office): +49 (951) 863 1860 E-mail: [email protected] Nationality: German Education (short track): Dipl.-Psych. (Psychology), 1998 MA (Philosophy), 1999 PhD (Psychology), 2003 Habilitation (Psychology), 2006 2. Education 03/2006 Habilitation (PD/Dr. habil) in Psychology, University of Vienna; title of habilitation thesis: “On the processing and representation of complex visual objects”; exhaustive venia docendi for all scientific areas of Psychology 02/2003 Doctorate (PhD, Dr.phil.) in Psychology, Freie Universität Berlin, “magna cum laude”; title of thesis: “Face Processing: Early processing in the recognition of faces” 10/1999-09/2000 Undergraduate student of economics (BWL), Fernuniversität Hagen 12/1999 Magister (MA) in Philosophy, University of Trier (1.2 – “very good”); title of thesis: “Connectionism: may connectionist systems develop form of consciousness?” 11/1998 Dipl.-Psych. (Diplompsychologe/MSc) in Psychology, University of Trier (1.2 – “very good”); title of thesis: “Influences of emotional and configural variables on the processing of spatial information” 10/1994-09/1997 Undergraduate student of art history and politics, University of Trier 10/1991-03/1992 Undergraduate student for the studies of traffic system (Verkehrswesen), University of Technology, Berlin Claus-Christian Carbon curriculum vitae - 2/21 - 07/1990-12/1990 Training and qualification for a national paramedic (Bavarian Red Cross) 06/1990 Abitur (GCSE) with focus on mathematics and physics, Dominikus Zimmermann Gymnasium, Landsberg am Lech Summer/Spring Schools 03/2008 EOG and Eyetracking in Psychological Research: Workshop of the University of Cologne (Prof.
    [Show full text]
  • Neural Processing of Facial Expressions As Modulators of Communicative Intention Facial Expression in Communicative Intention
    Neural processing of facial expressions as modulators of communicative intention Facial expression in communicative intention Rasgado-Toledo Jalil1, Valles-Capetillo Elizabeth1, Giudicessi Averi2, Giordano Magda1*. 1Departament of Behavioral and Cognitive Neurobiology, Instituto de Neurobiología, Universidad Nacional Autónoma de México, Blvd. Juriquilla 3001, Juriquilla, Querétaro, 76230 México 2 Department of Psychiatry, University of California, San Diego 9500 Gilman Drive La Jolla, CA * Corresponding author: E-mail address: [email protected] (Giordano, Magda) Number of pages: 31 Number of figures: 8 Number of tables: 1 Number of words: Abstract: 250 Introduction: 621 Discussion: 1499 Conflict of Interest The authors declare no competing financial interests Acknowledgments We thank Drs. Erick Pasaye, Leopoldo González-Santos, Juan Ortiz, and the personnel at the National Laboratory for Magnetic Resonance Imaging (LANIREM UNAM), for their valuable assistance. This work received support from Luis Aguilar, Alejandro De León, Carlos Flores, and Jair García of the Laboratorio Nacional de Visualización Científica Avanzada (LAVIS). Additional thanks to Azalea Reyes-Aguilar for the valuable feedback for the entire study process. Significance Statement In the present study, we tested whether previous observation of a facial expression associated with an emotion can modify the interpretation of a speech act that follows a facial expression; thus providing empirical evidence that facial expression impacts communication. Our results could help to understand how we communicate and the aspects of communication that are necessary to comprehend pragmatic forms of language as speech acts. We applied both multivariate and univariate analysis models to compare brain structures involved in speech acts comprehension, and found that the pattern of hemodynamic response of the frontal gyrus and medial prefrontal cortex can be used to decode classification decisions.
    [Show full text]
  • 1.2 Le Zhang, Microsoft
    Objective • “Taking recommendation technology to the masses” • Helping researchers and developers to quickly select, prototype, demonstrate, and productionize a recommender system • Accelerating enterprise-grade development and deployment of a recommender system into production • Key takeaways of the talk • Systematic overview of the recommendation technology from a pragmatic perspective • Best practices (with example codes) in developing recommender systems • State-of-the-art academic research in recommendation algorithms Outline • Recommendation system in modern business (10min) • Recommendation algorithms and implementations (20min) • End to end example of building a scalable recommender (10min) • Q & A (5min) Recommendation system in modern business “35% of what consumers purchase on Amazon and 75% of what they watch on Netflix come from recommendations algorithms” McKinsey & Co Challenges Limited resource Fragmented solutions Fast-growing area New algorithms sprout There is limited reference every day – not many Packages/tools/modules off- and guidance to build a people have such the-shelf are very recommender system on expertise to implement fragmented, not scalable, scale to support and deploy a and not well compatible with enterprise-grade recommender by using each other scenarios the state-of-the-arts algorithms Microsoft/Recommenders • Microsoft/Recommenders • Collaborative development efforts of Microsoft Cloud & AI data scientists, Microsoft Research researchers, academia researchers, etc. • Github url: https://github.com/Microsoft/Recommenders • Contents • Utilities: modular functions for model creation, data manipulation, evaluation, etc. • Algorithms: SVD, SAR, ALS, NCF, Wide&Deep, xDeepFM, DKN, etc. • Notebooks: HOW-TO examples for end to end recommender building. • Highlights • 3700+ stars on GitHub • Featured in YC Hacker News, O’Reily Data Newsletter, GitHub weekly trending list, etc.
    [Show full text]
  • Alibaba Amazon
    Online Virtual Sponsor Expo Sunday December 6th Amazon | Science Alibaba Neural Magic Facebook Scale AI IBM Sony Apple Quantum Black Benevolent AI Zalando Kuaishou Cruise Ant Group | Alipay Microsoft Wild Me Deep Genomics Netflix Research Google Research CausaLens Hudson River Trading 1 TALKS & PANELS Scikit-learn and Fairness, Tools and Challenges 5 AM PST (1 PM UTC) Speaker: Adrin Jalali Fairness, accountability, and transparency in machine learning have become a major part of the ML discourse. Since these issues have attracted attention from the public, and certain legislations are being put in place regulating the usage of machine learning in certain domains, the industry has been catching up with the topic and a few groups have been developing toolboxes to allow practitioners incorporate fairness constraints into their pipelines and make their models more transparent and accountable. AIF360 and fairlearn are just two examples available in Python. On the machine learning side, scikit-learn has been one of the most commonly used libraries which has been extended by third party libraries such as XGBoost and imbalanced-learn. However, when it comes to incorporating fairness constraints in a usual scikit- learn pipeline, there are challenges and limitations related to the API, which has made developing a scikit-learn compatible fairness focused package challenging and hampering the adoption of these tools in the industry. In this talk, we start with a common classification pipeline, then we assess fairness/bias of the data/outputs using disparate impact ratio as an example metric, and finally mitigate the unfair outputs and search for hyperparameters which give the best accuracy while satisfying fairness constraints.
    [Show full text]
  • Asana Is Marketed As a Project Management Software That Allows Collaborators to Work Together on Multiple Projects
    Asana Asana is marketed as a project management software that allows collaborators to work together on multiple projects. The app is available on iOS, Android, and web, and it is primarily designed for business teams (although can be used without business affiliation). Its ability for project visualization lies in a note-card style interface called Kanban boards, combined with a work-flow style task completion ability. One can create categories based on project sections, can move these categories around, and for each category, can create notecards representing sub-sections or tasks. On the back of each notecard, users can provide descriptions, specify tasks and sub-tasks, delegate tasks to different project members, attach documents, and comment on the notecard. This allows the user to take their project visualization (notecard-style) and turn it into a robust series of to-dos that help the user take their project from visualization to completion. Its collaborative features allow mentors and mentees to work together on a project in its initial stages, before words are on the page—a crucial time for mentorship in the writing process, but one that is also rife with potential confusion and misunderstanding on both parts. This kind of project visualization/management app can help with that. Points in favor: • The ability for workflow creation: making to do lists, pulling tasks into “in progress” stages, and moving tasks to a done list allows users to be mindful of how much they are taking on at once, and how much they’ve done. This can drive motivation in useful ways.
    [Show full text]
  • Microsoft Recommenders Tools to Accelerate Developing Recommender Systems Scott Graham, Jun Ki Min, Tao Wu {Scott.Graham, Jun.Min, Tao.Wu} @ Microsoft.Com
    Microsoft Recommenders Tools to Accelerate Developing Recommender Systems Scott Graham, Jun Ki Min, Tao Wu {Scott.Graham, Jun.Min, Tao.Wu} @ microsoft.com Microsoft Recommenders repository is an open source collection of python utilities and Jupyter notebooks to help accelerate the process of designing, evaluating, and deploying recommender systems. The repository is actively maintained and constantly being improved and extended. It is publicly available on GitHub and licensed through a permissive MIT License to promote widespread use of the tools. Core components reco_utils • Dataset helpers and splitters • Ranking & rating metrics • Hyperparameter tuning helpers • Operationalization utilities notebooks • Spark ALS • Inhouse: SAR, Vowpal Wabbit, LightGBM, RLRMC, xDeepFM • Deep learning: NCF, Wide-Deep • Open source frameworks: Surprise, FastAI tests • Unit tests • Smoke & Integration tests Workflow Train set Recommender train Environment Data Data split setup load Validation set Hyperparameter tune Operationalize Test set Evaluate • Rating metrics • Content based models • Clone & conda-install on a local / • Chronological split • Ranking metrics • Collaborative-filtering models virtual machine (Windows / Linux / • Stratified split • Classification metrics • Deep-learning models Mac) or Databricks • Non-overlap split • Docker • Azure Machine Learning service (AzureML) • Random split • Tuning libraries (NNI, HyperOpt) • pip install • Azure Kubernetes Service (AKS) • Tuning services (AzureML) • Azure Databricks Example: Movie Recommendation
    [Show full text]
  • Digitale Toolbox
    DIGITALE TOOLBOX IMPRESSUM Das Impressum der TU Dresden gilt für dieses Dokument mit den folgenden Ergänzungen: Diese Sammlung digitaler Tools ist im Rahmen des Studienerfolgsprojekts „Lerntransfermethoden – Entwicklung und Erprobung von Methoden zur Erhöhung des Lerntransfers von Studierenden in MINT- Studiengängen im Rahmen eines Multiplikatorenprogramms“ entstanden. Dieses wurde vom ESF und Freistaat Sachsen gefördert. 1. Auflage, Mai 2019 Medienzentrum & Zentrum für Weiterbildung Inhaltliche Gestaltung: Antonia Stagge Design: Antonia Stagge KONTAKT Technische Universität Dresden, Medienzentrum Antonia Stagge Strehlener Str. 22/24 01069 Dresden E-M ail: [email protected] Tel.: +49 351 463-32498 WAS? Diese Toolsammlung ist an Studierende aus den MINT-Fächern ? gerichtet und enthält nützliche digitale Tools, die das Lernen und Zusammenarbeiten mit anderen erleichtern können. Wir haben versucht, möglichst frei verfügbare und kostenlose Tools auszuwählen und kurz vorzustellen. Wir erheben dabei keinen Anspruch auf Vollständigkeit und es geht uns auch nicht darum, für die entsprechenden Tools Werbung zu machen, sondern eine Bandbreite an Möglichkeiten aufzuzeigen. Insgesamt 28 Tools aus 10 Kategorien bzw. für 10 verschiedene Einsatzbereiche werden genauer vorgestellt. Auf den einzelnen Toolseiten erhältst du aber auch noch Hinweise auf ähnliche Tools, sodass du am Ende aus einer Vielzahl von Möglichkeiten schöpfen kannst. In unserem Studienerfolgsprojekt haben wir zudem eine Methodensammlung entwickelt, die dir dabei helfen soll, in deinem Studium nachhaltig Kompetenzen aufzubauen. Zugleich hilft sie dir dabei, dich besser und effizienter auf Prüfungen vorzubereiten. Hierfür haben wir eng mit Lehrenden der MINT-Studiengänge zusammengearbeitet, um ein auf die MINT-Fächer angepasstes Angebot zu entwickeln. Die Methodensammlung findest du hier unter dem Punkt „Nachhaltiges Lernen fördern? Mit unseren Materialien und Schulungen“.
    [Show full text]