Popular Data Science Tools

Angel Marchev 2.0 Kaloyan Haralampiev 2 3 What we did?

Cooperation with communities Students and experts and Universities from Europe with up to 20 years of 25 real cases in 3 and Asia experience Datathons

4 years Over 3000 members 63 superb solutions

Area of , Working with SME and Big More than 50 NLP, Data enrichment, companies countries Computer Vision and AI What we dare to? 4

Big data conference

Our first event The First online #Datathon2018

Hack the News #Datathon

• Over 50 meetups • 8 conferences participation Academia • 2 workshops #Datathon #Datathon2018 v2

First Datathon in CEE

Hack the Fake News Datathon Two projects with 30 volunteers 2018 2017 Sep 2014, Nov 2015 2016 Mar May Feb Apr Jun 5 Past #Datathon2018 38 solutions

39 teams 9 cases

24 32 mentors and 16563 messages 144 participants 493 chat rooms countries industry experts exchanged

The #Datathon2018 participants There was great fun with more than А lot of beer and pizza was 38 quality Data solutions at the Great results challenging managed to solve all cases 4 fun sessions consumed end even for the companies 6 Impressions

“From all finalists we did see “The results of our case are good and novel approach... impressive and have further motivated our &D department to also those who didn't arrive to explore more opportunities and apply finals, were also really close ... some of the team results that worked so good job to all teams!" on it.” Tomislav Križan CPO, Member of the Board Milena Yankova Head of Research & Innovation “Thank you all for this great weekend. It was a fantastic challenge and I am happy that I “The best thing about this Datathon saw deep technical work from all was its global footprint. I was amazed the participants! I will be always by the sheer enthusiasm that the Agamemnon here to support the DSS participants demonstrated. The Baltagiannis community” resilience and adaptability shown by a Principal Data Scientist lot of them in providing a working Shashank Shekhar solution to real life problems made Manager - Data Sciences . “The teams solutions were well documented in CRISP-DM this Datathon a huge success." Methodology at Datathon 2018 organized by DSS, in which Kaufland was proud to participate” About The Tools Introduction

• It is impossible to cover all tools, • so we reduced the number of tools covered to the ones we use • Still the task is hard, due to: – Various types of tools (noise in the input data) – Many criteria (so multi-dimensional problem) – Tools for many purposes (overlapping categories) • Hmmmm..!? Sounds like an ideal case for Multi-dimensional scaling (MDS) • SO LET’S GO FULL NERDY ON IT MDS Map Features: Application • • Econometrics • Workflow • Console • Menus • Nodes • Online License • Free • Non-free

Relatedness - Free - Non free

Popularity Popularity: Interactivity MDS Map

“The User-Friendlies”

“The All-Stars”

“The Classics” “The On-liners” The Classics Excel Data Analysis

• Application: Statistical analysis • Interface: Menus and windows • Price: Licensed

• Pros: Availability (almost everybody have Excel) • Cons: Works with selected cells not with variable names IBM SPSS Statistics

• Application: Statistical analysis; Econometric analysis • Interface: Menus and windows; Command console • Price: Licensed

• Pros: Very large set of analyses • Cons: Non-interactive PSPP

• Application: Statistical analysis • Interface: Menus and windows • Price: Free

• Pros: “Free” SPSS Statistics • Cons: Relatively small set of analyses; Non-interactive eViews

• Application: Econometric analysis • Interface: Menus and windows; Command console • Price: Licensed

• Pros: Efficient calculations • Cons: Data import issues

• Application: Econometric analysis • Interface: Menus and windows; Command console • Price: Free

• Pros: Hansl (localized user manual) • Cons: Limit to the volume of data The All-stars Python

• Application: Statistical analysis; Econometric analysis; Data mining • Interface: Command console • Price: Free

• Pros: Global community developing libs • Cons: R (+R studio)

• Application: Statistical analysis; Econometric analysis; Data mining • Interface: Command console • Price: Free

• Pros: Global community developing libs • Cons: a little weird language Jupyter Notebook

• Application: Data mining • Interface: Online platform • Price: Free

• Pros: Industry standard for Data Science • Cons: MatLab

• Application: Statistical analysis; Econometric analysis • Interface: Command console • Price: Licensed

• Pros: Great documentation, parallel computing • Cons: Expensive The User-friendlies JASP • Application: Statistical analysis • Interface: Menus and windows • Price: Free

• Pros: Interactive • Cons: Relatively small set of analyses

• Application: Statistical analysis; Data mining • Interface: Graphical stream/workflow • Price: Free

• Pros: One of the original revolutionaries • Cons: outdated and clumsy Rapid Miner

• Application: Statistical analysis; Data mining • Interface: Graphical stream/workflow • Price: Licensed

• Pros: Probably the most intuitive interface • Cons: KNIME

• Application: Statistical analysis; Data mining • Interface: Graphical stream/workflow • Price: Free

• Pros: Interactive • Cons: Relatively small set of analyses

• Application: Data mining • Interface: Graphical stream/workflow • Price: Free

• Pros: Interactive • Cons: Relatively small set of analyses IBM SPSS Modeler

• Application: Econometric analysis; Data mining • Interface: Graphical stream/workflow • Price: Licensed

• Pros: well utilizing resources • Cons: not user friendly when dealing with lots of features MatLab Classification Learner

• Application: Data mining • Interface: Graphical stream/workflow • Price: Licensed

• Pros: part of Matlab environment • Cons: still under development to include more models The On-liners Microsoft Azure

• Application: Data mining • Interface: Online platform • Price: Licensed

• Pros: Many tools already available • Cons: Could be a little hard to set-up IBM Watson Studio • Application: Data mining • Interface: Online platform • Price: Licensed

• Pros: brand new • Cons: still some computability issues Amazon ML

• Application: Data mining • Interface: Online platform • Price: Licensed

• Pros: integrated with AWS S3 and could work real- time • Cons: still under development to include more models Google Colab

• Application: Data mining • Interface: Online platform • Price: Free

• Pros: GPU computation via Tensor Flow • Cons: 12 hours at a time Selecting the right tool Selection tree • What type of problem do you solve? (Application) • What type of interface would be suitable? (Workflow) • Licensed or non-licensed? (Price) Application Workflow Price Software Statistical analysis Menus and windows Licensed Excel Data Analysis IBM SPSS Statistics Free PSPP JASP Command console Licensed MatLab IBM SPSS Statistics Free R (+ R Studio) Python Graphical stream/workflow Licensed Rapid Miner Free KNIME Weka Econometric analysis Menus and windows Licensed eViews IBM SPSS Statistics Free Gretl Command console Licensed eViews IBM SPSS Statistics MatLab Free Gretl R (+ R Studio) Python Graphical stream/workflow Licensed IBM SPSS Modeler Data mining Command console Licensed Matlab Free R (+ R Studio) Python Graphical stream/workflow Licensed IBM SPSS Modeler Rapid Miner Matlab Classification App Free Orange KNIME Weka Online platform Licensed IBM Watson Studio Microsoft Azure Amazon ML Free Google Colab Jupyter Notebook Q & A

[email protected]

[email protected]