Data Science Research

Home , Computational science

DATA SCIENCE RESEARCH

Data Science Fest 22.11.2019

Faculty of Science Data Science Fest 22.11.2019 1 TODAY’S PROGRAMME

1. Jukka K. Nurminen: Software Systems for Data-Intensive Computing

2. Keijo Heljanko: Challenges in Parallel and Distributed Data Science

3. Tong Li: Long-term Mobile App Usage

4. Ghazaleh Kia: Spatiotemporal Data Analysis for Sustainability Science

5. Jussi Kangasharju: Collaborative Networking

6. Indrė Žliobaitė: Data Science and Evolution

Faculty of Science Data Science Fest 22.11.2019 2 TODAY’S PROGRAMME

7. Antti Koskela: Probabilistic Inference, Privacy and Computational Biology

8. Qingsong Guo: DB meets AI - Towards Autonomous Database Systems

9. Giulio Jacucci: Ubiquitous Interaction

10. Leo Leppänen: News Automation

11. Dorota Glowacka / Alan Medlar: Exploratory Search and Personalisation

Faculty of Science Data Science Fest 22.11.2019 3 TODAY’S PROGRAMME

12. Kai Puolamäki: Exploratory Data Analysis

13. Jussi Määttä: MachQu - Probabilistic Machine Learning for Material Design

14. Jörg Tiedemann: Language technology / NLP

15. Antti Ukkonen: Machine learning, language, and online conversations

16. Sasu Tarkoma: Content-centric structures and networking

17. Timo Tenhovuori: Data Science in Nokia

Faculty of Science Data Science Fest 22.11.2019 4 Jukka K Nurminen Professor of Software Systems for Data-Intensive Computing (in Natural Sciences)

• My research focus is on the engineering and operation of AI systems • Testing, maintenance, and other software lifecycle steps for AI systems • Sustainable machine learning platforms ‒ Energy-efficiency ‒ New opportunities, especially quantum computing • Methods and tools to make AI ethics useful for developers • For me computer science is about solving problems • New and relevant challenges are always welcome • Experience in industrial research (Nokia), applied research (VTT), and academic research (Aalto) Email: [email protected] Publications: https://scholar.google.fi/citations?user=cmftgn4AAAAJ LinkedIn: https://www.linkedin.com/in/jukka-k-nurminen/

Department of Computer Science Jukka K Nurminen 21.11.2019 1 WHO IS ELAINE HERZBERG?

Department of Computer Science Data Science Fest / Jukka K Nurminen 22.11.2019 2 Software Life-Cycle Costs - Schach 2002

Department of Computer Science Data Science Fest / Jukka K Nurminen 22.11.2019 3 HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI

Challenges in Parallel and Distributed Data Science

Keijo Heljanko ‹keijo.heljanko@helsinki.ﬁ› Nov 22, 2019

University of Helsinki Department of Computer Science Helsinki Center for Data Science (HiDATA) Big Data

Seagate sponsored IDC study estimates the Global Datasphere to be 33 Zettabytes (33 000 000 000 TB) in 2018 and to grow to 175 Zettabytes by 2025 Data comes from: Video, sensor data, Internet sites, social media, AI Applications, healthcare, . . . For example Netﬂix is collecting 1 PB of data per month from its video service user behaviour, total data warehouse 60+ PB According to Cisco, Internet trafﬁc volume is growing 30% per year, while Big Data storage is growing 51% per year End of 40 Years of Sequential Computing Improvements

The 40 years of general purpose sequential computing performance increase has slowed to 3% per year [Hennessy&Patterson’18]: Implications of the End of Free Lunch

Data volumes and application requirements are growing much faster than microprocessor performance The number of transistors and cores in a microprocessor is still growing at a high rate Programming models need to change to efﬁciently exploit all the available parallelism Programs should be made more parallel every year to be able to cope with data volume growth Future Challenge for Parallel and Distributed Systems is to come up with programming models that scale to the massively growing data volumes Research Topics in 2019

Distribution and Parallelization: Pan-Genomics pipelines with Apache Spark Applied Machine Learning: Prediction of Wet-end Breaks in a Paper Machine, Prediction of events in business processes Automated Veriﬁcation: Concurrency bug ﬁnding tool for Linux kernel developers Design of Distributed Systems: Kubernetes based architecture for IoT Edge analytics Theory of Distributed Systems: Tool for proving parameterized systems correct "What Apps Did You Use?": Understanding the Long-term Evolution of Mobile App Usage

Tong LI

22-11-2019

1 Long-term Mobile App Usage Dataset

# of # of # of # of App Attributes Date Area Users Records Apps Categories User ID, apps, time zone, 01/2012 - Worldwide (over 1,465 12,457,867 110,932 32 timestamp, mobile network type 12/2017 80 countries)

2 The evolution of app-category usage

3 The evolution of app-category usage

Development of mobile networks Correlation of app categories

4 The evolution of app usage

Pareto effect

Correlation of apps in ‘News and magazines’ category across Correlation of apps in ‘Social’ category across different 5 different years. years. Summary

The long-term usage evolution of mobile app-category and app exhibits different processes. A complete usage evolution of an app-category has two stages, i.e., a growth stage and a plateau stage. However, apart from the above two stages, apps have one more different stage, i.e., an elimination stage.

The development of technologies will trigger the growth stage in both app categories and apps. This increasing trend will not be influenced by the maturity of app categories and the Pareto effect.

The fierce intra-competition of apps results in the occurrence of the elimination stage of app usage and the decrease in correlations between apps in the same category.

6 SPATIOTEMPORALSPATIOTEMPORAL DATADATA ANALYSISANALYSIS FORFOR SUSTAINABILITYSUSTAINABILITY SCIENCESCIENCE LauraLaura Ruotsalainen,Ruotsalainen, AssociateAssociate ProfessorProfessor GhazalehGhazaleh Kia,Kia, PhDPhD StudentStudent TittiTitti Malmivirta,Malmivirta, ResearchResearch AssistantAssistant

SergeySergey Nikolskiy,Nikolskiy, co-supervisedco-supervised PhDPhD StudentStudent

18/11/2019 1 GROUP’S RESEARCH FOCUS

• Navigation in challenging environments (urban, indoors, Arctic) • Autonomous systems (Road transport, UAVs) • Computer vision methods for navigation and situational awareness • Mitigation of intentional interference of satellite positioning

18/11/2019 2 NAVIGATION IN CHALLENGING SITUATIONS

• Statistical Error Modelling • Recursive Bayesian Estimation • Measurement fusion • Cooperative positioning • Machine Learning • Recognizing environment and motion • Route prediction • Improving computer vision / radio signal processing

18/11/2019 3 AUTONOMOUS TRAFFIC

Traffic deaths 1.35 M per year globally, 10 M injured or disabled 93 % of accidents caused by or contributed to by driver error Greenhouse gas emissions Drones: 20 000 over city / hour / 2035 (EU) Pedestrians / Bicycles Indoor Navigation

18/11/2019 4 60˚60˚ 1010 1.21.2 N,N, 24˚24˚ 5757 1818 EE

18/11/2019 5 Collaborative Networking Prof. Jussi Kangasharju University of Helsinki Research Focus Areas

• Edge and cloud computing • Combining AI and networking • Information-centric networking (ICN) • Internet of Things (IoT) • Green networking Edge Computing

• Cloud computing centralizes all data and computation • Increased data amounts (big data, IoT) make this inefficient • Too much network traffic and processing load at central cloud • Lot of data only of local interest -> No benefit to send to central cloud • Edge and fog computing move processing towards edge of network • Lower latencies for information access and processing • Autonomous driving, AR/VR, etc. • Less wide-area network traffic • Easier to ensure privacy of user data due to limited scope of usage • Edge computing is a key feature of 5G • Our current research topics: • Discovery of edge servers • Edge server placement • Flexible computations Intelligent Containers (ICON)

• Autonomous computing entities • Providing services to users • ICON swarm observes environment • Migrate or replicate closer to users • Application owner tunes behavior of with latency and budget knobs Information-Centric Networking (ICN)

• ICN evolves Internet from host-centric paradigm to an information- centric view of the network (named information) • Supports intermittent connectivity, user mobility, multicast • Key features: Name-based routing, universal caching, built-in security • Our current research topics: • How to discover cached content in a distributed environment? • How to handle mobility of content producers and consumers? • Using ICN for managing distributed data and computation in IoT Contact Information

• Website: http://www.helsinki.fi/collaborative-networking • Email: [email protected] • Twitter: @kangasharju Group leader: Indrė Žliobaitė, Assistant professor, [email protected]

Analyzing changing world Analyzing changes Computational methods for evolving data, in nature and society change detection

Analyzing and interpreting Transparency and accountability the global fossil record in machine learning

Art: Mauricio Anton, Ika Osterblad Concept drift

Model does not ch an ge

Process changes

so u rce: Evo n ik In d u st r ie s

Source: https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwj5w6LqycrhAhWr- yoKHaydCqEQjRx6BAgBEAU&url=https%3A%2F%2Fwww.genetec.com%2Fsolutions%2Fall-products%2Ftraffic- sense&psig=AOvVaw3idS_4zuwXcMJparMR1LXH&ust=1555159241460997 Evolutionary palaeontology Analyzing the global fossil record

Looking at changes in the past (hopefully) helps to understand what's coming in the future and how new emerging ecosystems will work Climate change, societal change and sustainability

novel ecosystems

Photo credit: Markku Ulander, Kayle Reed How to evaluate Do species age? predictive models that adapt?

What evolution optimizes for?

How do we make robust and reliable models?

How do faunal communities rise and fall?

Joint work with many interdisciplinary collaborators

Funding PROBABILISTIC INFERENCE, PRIVACY AND COMPUTATIONAL BIOLOGY

Group leader: Antti Honkela, associate professor

Department of Computer Science Finnish Center for Artificial Intelligence FCAI

Matemaattis-luonnontieteellinen tiedekunta 1 RESEARCH AND TEACHING TOPICS

• Research: Efficient methods for Bayesian inference • Differentially private machine learning and Bayesian inference • Genome sequencing data analysis and modelling • Teaching: • Computational Statistics, period I, every year • Trustworthy Machine Learning (with I. Žliobaitė), period II, every year • Seminar: Machine Learning with Distributed Data, spring 2020

Matemaattis-luonnontieteellinen tiedekunta 2 Privacy-preserving machine learning Di↵erential privacy (Dworkwithet al.,2006) differential privacy (DP) Definition Di↵erential privacy (Dwork et al.,2006) An algorithm operatingResults on a data should set notis said change to be ✏-di too↵erentially much, private even( ✏if-DP) one M • Definition D if for any two data sets and 0,di↵ering only by one sample, the probabilities of person'sAn algorithm dataoperating changes on a data set is said to be ✏-di↵erentially private (✏-DP) D D M D obtaining a result in anyif for set anyS twofulfil data sets and ,di↵ering only by one sample, the probabilities of D D0 Di↵obtainingerential a privacy result in (Dwork any set S etfulfil al.,2006) Pr( ( ) S) ✏ M D 2 Pr( (e). S) M D 2 e✏. Pr( ( 0) SPr() ( ) S)  Di↵erential privacyDefinition (Dwork Diet↵ al.Merential,2006)D 2 privacyM D (Dwork0 2 et al.,2006) An algorithm operating on a data set is said to be ✏-di↵erentially private (✏-DP) M D if for any two data sets and 0,di↵ering only by one sample, the probabilities of Definition I Provides protectionDefinitionD againstD adversaries with side information I Provides protectionobtaining against a result adversaries in any set S withfulfil side information An algorithm operatingI Is invariant on a data toAn post-processing set algorithmis said tooperating be ✏-di↵ onerentially a data set privateis said(✏-DP) to be ✏-di↵erentially private (✏-DP) I Is invariant toM post-processingI Degrades gracefully underD compositionPr(M ( ) S) ✏ D if for any two data sets and 0,diif↵ forering any only two databyM one setsD sample,2 and thee .0,di probabilities↵ering only of by one sample, the probabilities of D D Pr( ( 0) DS)  D I Degradesobtaining gracefully a result in any under set S compositionfulfilobtaining a resultM inD any2 set S fulfil

Pr( ( ) S) ✏ Pr( ( ) S) ✏ I Provides protectionM D against2 adversariese . with side informationM D 2 e . Pr( ( 0) S)  Pr( ( 0) S)  I Is invariant toM post-processingD 2 M D 2 I Degrades gracefully under composition Dwork et al. (2006) I Provides protection against adversariesI Provides with protection side information against adversaries with side information

I Is invariant to post-processing I Is invariant to post-processing I Degrades gracefully under composition I Degrades gracefully under composition OUR WORK IN DIFFERENTIALLY PRIVATE ML

• Algorithms for DP machine learning and Bayesian inference • DP learning with distributed data • DP data anonymisation • Applications, e.g. drug sensitivity prediction

Figure: reproducing statistical discoveries from strongly anonymised data

Matemaattis-luonnontieteellinen tiedekunta 4 DB meets AI Towards Autonomous Database Systems

Qingsong Guo UDBMS Group, 2019.11.22 DBMS is a solved problem?

• Over 40 years of DB research • We developed – fantastic algorithms – great systems – clever query processing and storage strategies – a lot of brilliant stuff … • DBMSs have become fast – when you look at TPC-C: we are currently able to execute ~half a million transactions per second – to simple index-lookups, e.g., in a hash table, we are currently at 20 million operations per second DBMS is not solved yet

Take query optimization as an example, DQ (deep reinforcement learning) outperforms the traditional methods

mean sub-optimality of the queries, i.e., “cost(plan from each algorithm) / cost(plan from optimal plan)”, so lower is better Deep Learning to replace humans in the DB-loop

Humans in the DB-loop: – ETL (extract, transform, & load) – schema design – data integration – physical design – knob tuning – index selection – partitioning and replication – designing of database internals, e.g., query optimizer, index structures, data layouts, and storage engines. What are we doing?

Building an autonomous database system (AutoDB) that is capability of self-learning and automatic management of big (multi-model) data.

● Essential properties – Everything in an AutoDB is model (probabilistic model/ML model) – The database is model-driven – The administrative processes are automatic – The database internals are automated Autonomous database system About UDBMS group

UDMBS = Unified Database Management Systems

Prof. Jiaheng Lu [email protected] C211 Exactum

Postdoc: Qingsong Guo([email protected], C210 Exatcum) 4 PhD students 1 Master student

UDBMS: https://www.helsinki.fi/en/researchgroups/unified-database- management-systems-udbms Research group: Ubiquitous Interaction

Courses: Human Computer Interaction Interactive data visualization Seminar on Advanced Topics in HCI

Giulio Jacucci, Professor, PhD Dr Mikko Kytö, Chen He [email protected] Data intensive Interaction Techniques: Mind Search

Content of Interest 1 Atom 2 Timeline of atomic… 3 Neutron 4 Timeline of quantum… 5 Electron 6 Timeline of physical… 7 History of physics 8 Proton 9 History of chemistry

Relevant Irrelevant Intent Modeling Brain Signal Term Relevance Prediction Based Retrieval

Eugster, M. J., Ruotsalo, T., Spapé, M. M., Barral, O., Ravaja, N., Jacucci, G., & Kaski, S. (2016). Natural brain- information interfaces: Recommending information by relevance inferred from human brain signals. Scientific reports, 6, 38580. Affective annotation of media

14 Oswald Barral et al.

Barral, O., et al . (2016). Extracting relevance and affect information from physiological text annotation. User Modeling and User-Adapted Interaction, 26(5), 493-520.

Fig. 4: Experimental task and user interface. The participant selects a news portal of her choice, and browses the news freely. After reading a speciﬁc news article, the participant clicks on one of the affective feedback icons (from left to right: “happy”, “sad”, “angry”, and “neutral”). The participant is allowed to provide voluntary feedback, as well as to change a news portal at any point, by entering a new URL in the text box designated for it. Participants read news articles for 45-60 minutes.

Electrodermal activity was used as the physiological signal, as it has been proved to be indicative of arousal and stimulus novelty (Dawson et al., 2007; Boucsein, 2012). In order to further minimize the intrusiveness of the recording, and given the less binding ﬁndings for CSA reported in Experiment 1 (see Section 3), in the present experiment we solely relied on EDA.

4.1 Participants

Twenty-four participants (five females) took part in the study, two of which partici- pated also in Experiment 1. Participants ranged from 23 to 36 years old (M = 29.7). Three participants were postdoctoral researchers, and the rest were students (18 post- graduate, and three undergraduate) from the University of Helsinki and Aalto Uni- versity in Finland. Nine of the participants read news in their native language only, five readBarral the news, bothO., inKosunen, their native and I., foreign & Jacucci language,, andG. ten(2018). only in a No foreign Need to Laugh Out Loud: Predicting Humor Appraisal of Comic Strips Based on language.Physiological In total, 15 differentSignals mother tonguesin a Realistic were reported.Environment. Overall, participants ACM Transactions on Computer-Human Interaction (TOCHI), 24(6), 40. reported high engagement with the content they were reading (M=4.08, on five-level Likert scale), and to not feel intruded by being asked to provide feedback (M = 1.95, on a five-level Likert scale). Two of the participants were left handed, even though only one of them used the computer mouse with the left hand. Participants reported themselves to be physically and mentally healthy. MatkaHupi)–)A)persuasive)mobile) applica5on)for)sustainable)mobility)

An#$Jylhä,$Pe-eri$Nurmi,$Samuli$Hemminki,$Miika$Sirén,$Dinesh$Wijekoon,$Chao$An,$Giulio$Jacucci$ INTRODUCTION) AUTOMATIC) TRANSPORT) MODE) • Aim:% sustainable% urban% mobility% by% mo1va1ng% the% travelers%to%use%more%eco6friendly%transport%op1ons%by% DETECTION) personalized)mobility)challenges% • Based%on%sensor%fusion%(e.g.,%GPS,%accelerometer,%Wi6Fi)% • MatkaHupi:%a%mobile%(Android)%applica1on,%comprising% • Dis1nguishes% between% walking,% cycling,% and% motorized% • Journey%planner% transport% • Automa1c%transport%mode%detec1on% • For% trip% detec1on,% compares% computed% values% to% public% • Challenges%and%rewards% transport%schedule%informa1on%(open%API%by%HSL)% • Feedback%on%CO2%emissions% CHALLENGES) INTERACTION) DESIGN) AND) • Challenges% are% behavioral% goals% for% the% user,% either% set% by% the% system% (e.g.,% walk% for% 7% km,% try% out% the% tram,% reduce% FEATURES) weekly%emissions)%or%by%the%user%(trip%promises)% • GPS%loca1on%and%automa1c%trip%detec1on% • By%comple1ng%challenges,%the%user%earns%badges%and%points% • Main% screen% (Fig.% 1):% overall% emissions,% breakdown% by% transport%mode,%ac1ve%challenges% PILOT)STUDY) • Journey%planner%based%on%HSL%API%User interfaces• 4% week% study,%to 12% subjects%data (7% full% MatkaHupi for,% 5% without% behavior • Trip%history%with%emission%annota1ons% challenges)% • Challenges%and%rewards% • Ques1onnaires,%data%logging,%and%interview% • Automa1c% sugges1ons% for% new% challenges,% e.g.,% beWer% • Challenges%were%met%with%favorable%feedback% routes%and%alterna1ve%means%of%transport.% • Personaliza1on%is%required%change • The% user% can% ﬁne6tune% the% trip% details% (mode% of% • CO2% es1mates% increase% environmental% consciousness% for% transport,%line%number)%aYer%detec1on%if%needed.% some%users% FUTURE)WORK) • More%personalized%challenges% • Address% power% consump1on% issues% of% automa1c% transport% mode%detec1on% • Full6scale%user%study%

An%indicator%shows%the%current%detected%status%of%the%user.% The%user%can%see%how%the%emissions%of%the%current%week%compare%to%the% past%three%weeks.%

Middle%of%the%screen%shows%a%breakdown%of%emissions%per%transport%mode.%

BoWom%of%the%screen%shows%a%summary%of%the%ac1ve%challenges.%

A%journey%planner,%making%use%of% the%open%API%provided%by%HSL,%is% ACKNOWLEDGMENTS) available%for%planning%new%trips.% MatkaHupi% has% been% co6funded% by%the%European%Commission%and% The%user%can%view%a%list%of%past%trips.% EIT% ICT% Labs% and% is% a% joint% research% eﬀort% of% the% UIx%group% A%more%detailed%view%of%ongoing% and% the% Adap1ve% Compu1ng% and%completed%challenges.% group%at%HIIT.%

Figure)1:%The%main%screen%of%MatkaHupi.% Gabrielli, S., Forbes, P., Jylhä, A., Wells, S., Sirén, M., Hemminki, S., Nurmi, P., Maimone, R., Masthoff, J., Jacucci, G., (2014) Design Challenges in Motivating Change for Sustainable Urban Mobility. To appear in Computers in Human Behavior. Jylhä, A., Nurmi, P., Sirén, M., Hemminki, S., & Jacucci, G. (2013). Matkahupi: a persuasive mobile application for sustainable mobility. In Proceedings of the 2013 ACM UBICOM, 227-230. Interactive Data visualization techniques Interactive Data Map

Klouche, K., Ruotsalo, T., Micallef, L., Andolina, S., & Jacucci, G. (2017, March). Visual re-ranking for multi-aspect information retrieval. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval (pp. 57-66). ACM. Pointing while looking elsewhere: Designing for varying degrees of visual guidance during manual input.

Serim, B., & Jacucci, G. (2016, May). In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 5789-5800). ACM. Visual Search and Interactive Intent modelling

Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, and Samuel Kaski. 2014. Interactive intent modeling: information discovery beyond search. Commun. ACM 58, 1 (December 2014), 86-92. Insights generation and sharing in interactive data visualisation

He, C., Micallef, L., Kaski, S., Aittokallio, T., & Jacucci, G. (2017). MediSyn: uncertainty-aware visualization of multiple biomedical datasets to support drug treatment selection. BMC bioinformatics, 18(10), 393. NEWS AUTOMATION

• How can we automatically, in a generic/adaptive way, ... • identify something as newsworthy (~important), • organize the information into a coherent narrative, and • express the information in natural language?

Leo Leppänen – [email protected] Data Science Fest. 22/11/2019 1 DISCOVERY RESEARCH GROUP

• Led by Prof. Hannu Toivonen • Research themes • Computational creativity • Self-aware and self-adaptive systems • Interaction of and with computational, creative agents • Natural language analysis • Natural language generation

Leo Leppänen – [email protected] Data Science Fest. 22/11/2019 2 Exploratory Search and Personalisation research group

We do applied and empirical research related to: • interactive information retrieval • user modelling • reinforcement learning

https://glowacka.org REPAIR: an experimental framework to optimise Bust user experiences Head Body • Calibration on the basis of subjective qualities, e.g. aesthetics • User study: randomised experiments + binary feedback • Model preferences with Bayesian interval regression

8 9 10 11 • Application: neural style transfer (head, bust, body) Style weight Applied Document structure in different

Programming Languages Data Structures and Algorithms Logic Discrete Mathematics Computational Complexity scientific domains Mathematical Software Formal Languages Distributed Computing Numerical Analysis Software Engineering Databases Symbolic Computation Artificial Intelligence Computational Geometry Performance Systems and Control • IR for scientific literature often Multiagent Systems Machine Learning Networking and Internet Information Theory Computational Science Emerging Technologies performed on abstracts Computation and Language Cryptography and Security Multimedia Hardware Architecture Robotics Evolutionary Computing • Abstract representativeness Sound Game Theory Human−Computer Interaction Graphics Operating Systems subfield-specific: more Information Retrieval Social Networking Computer Vision Digital Libraries Computers and Society 2.0 2.2 2.4 2.6 2.8 3.0 theoretical subfields have less KL divergence (bits) representative abstracts

• Section-wise topic distributions Theoretical can infer subﬁeld interrelatedness Assessing writing quality in Wikipedia with neural language models

• Wikipedia is high quality in terms of factual accuracy - but has poor readability and style • How to prioritise which pages to edit to improve writing quality? • Old idea: query likelihood language models from '90s + neural language models (e.g. GPT-2) Exploratory Data Analysis Group

http://www.helsinki.fi/exploratory-data-analysis

Associate Professor Kai Puolamäki [email protected]

Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science

Exploratory Data Analysis Group / Kai Puolamäki 20 November 2019 1 Kai Puolamäki Associate Professor (computer science and atmospheric sciences) Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science Vice Director, Helsinki Institute for Information Technology HIIT Research interests: exploratory data analysis; machine learning; artificial intelligence; statistical robustness in data analysis; data science; analysis of simulated and [email protected] measurement data, with applications to atmospheric sciences. http://www.iki.fi/kaip/ Teaching: e.g., “DATA11002 Introduction to machine learning” Exactum A342 +358 50 5228111

Kai Puolamäki 20 November 2019 2 EXPLORATORY DATA ANALYSIS GROUP

Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science Group leader: Prof. Kai Puolamäki Website: http://www.helsinki.fi/exploratory-data-analysis Research topic: algorithmic and probabilistic methods of artificial intelligence that help expert users – such as physicists – understand large heterogenous data sources and make analytic data-based decisions. Applications to physical data from large- scale measurements and simulations, especially in atmospheric sciences.

Part of the group in DS 2019, Split, Croatia. From the left: Kai Puolamäki, Moritz Lange, Rafael Savvides, Henri Suominen, Anton Björklund

Exploratory Data Analysis Group 20 November 2019 3 RESEARCH HIGHLIGHT: SPARSE ROBUST REGRESSION FOR EXPLAINING CLASSIFIERS

ea t uare egre ion • Robust regression can handle outliers o u t egre ion • Novel robust regression algorithms SLISE • finds the largest subset of data items represented by a linear model to a given accuracy

• Local approximations using SLISE

• replace y-axis with outcomes of a complex elected oint I function and force regression line to pass through a selected points

• Local explanations using SLISE not 2 i 2 • explanation for a black box AI model

Björklund A., Henelius A., Oikarinen E., Kallonen K., Puolamäki K. (2019) Sparse Robust Regression for Explaining Classifiers. In: Discovery Science. DS 2019. LNCS 11828. https://doi.org/10.1007/978-3-030-33778-0_27 [DS 2019 best student paper award]

Exploratory Data Analysis Group 20 November 2019 4 CURRENTLY WORKING ON: MODELLING ATMOSPHERIC SIMULATIONS

estimate of • Large Eddy Simulations (LES) in response model prediction urban environment reliability • Terabytes of data, but small n, large p • Objectives: • Replace LES simulators with AI models ‒ how to estimate the reliability of the AI models • How to understand the data and emergent processes? • How to replace computationally expensive LES simulations with efficient tools, e.g., for city planners? • ... Moritz Lange, Henri Suominen, Rafael Savvides, Emilia Oikarinen, Kai Puolamäki. Joint work with Leena Järvi, Mona Kurppa, and Sasu Karttunen.

Exploratory Data Analysis Group 20 November 2019 5 INTERESTED?

• Ask for MSc topics • Call for summer internships to be announced around January 2020

http://www.helsinki.fi/exploratory-data-analysis

Exploratory Data Analysis Group / Kai Puolamäki 20 November 2019 6 MachQu: Probabilistic Machine Learning for Material Design Jussi Määttä, Postdoctoral Researcher

Joint work with Jyri Kimari, Viacheslav Bazaliy, Teemu Roos, Flyura Djurabekova, Kai Nordlund

Data Science Fest, 22 Nov 2019 Active Learning

Before: simulator calls oracle

After: simulator calls ML, ML maybe calls oracle

Uncertainty quantiﬁcation! Interpretable ML

Train a large ML model: atomic conﬁguration → physical quantities

Construct a small, interpretable approximation of the model Language Technology Helsinki - NLP http://blogs.helsinki.fi/language-technology/

meaning Jörg Tiedemann speaking Department of Digital Humanities University of Helsinki [email protected] understandingHELSINKI Language Technology the World’s languages Natural Language Understanding

machine natural language paraphrasing translation inference & language learning reasoning error correction

sentiment & How? conversational / emotion interactive AI detection meaning representations creative hate speech / Why? What? language fake news detection generation

data / knowledge digital mining managing humanities & text classification audiovisual linguistics data Language Technology and Deep Learning

learn from logical inference learn from examples translated data

How? meaning representations learn from Why? learn from annotated What? raw data data

learn from multimodal content What you can do in NLP

• enroll in our NLP courses • collaborate with us in a master thesis • work in our research projects

Courses we offer: • https://blogs.helsinki.fi/language-technology/study-info/ • https://blogs.helsinki.fi/language-technology/study/ courses-2019-2020/

Projects we run: • https://blogs.helsinki.fi/language-technology/hi-nlp/crosslingual/ • https://blogs.helsinki.fi/language-technology/resources/ • https://blogs.helsinki.fi/language-technology/project-ideas/ THESIS TOPICS

PROFESSOR SASU TARKOMA Example: Carat Dataset August 2018

● Originated in UC Berkeley, in collaboration with University of Helsinki ● Mobile app for Android and iOS ● Currently over 850 000 users ● >2.5 TB of data, > 250 million measurements ● Research project with many directions ● Publications and open datasets (Open Source clients): http://carat.cs.helsinki.fi

● Carat is the first system to use the device community to detect and correct energy problems ● Our method for diagnosing energy anomalies uses the community to infer a specification (expected energy use), and we call deviation from that inferred specification an anomaly ● Many awards including Mark Weiser Best Paper Award 2015. ● A. J. Oliner, A. P. Iyer, I. Stoica, E. Lagerspetz, S. Tarkoma. Carat: Collaborative energy diagnosis for mobile devices, In Proceedings of ACM SenSys ‘13. WHAT IS MEGASENSE? 5G • Multidiscplinary research program REAL-TIME • Scalable and intelligent real-time air pollution MASSIVE-SCALE monitoring solutions ENVIRONMENTAL SENSING WITH IOT AND AI • Hierarchical architecture with new low-lost sensors • Leverages low-cost air pollution sensors, ML/AI, and versatile connectivity provided by 4G/5G Nokia @ Espoo [email protected] Innovative and International HQ in Espoo

~3 000 employees 50+ nationalities

Nokia headquarter, All business units, Most sustainable city in EU Strong ecosystem Most innovative community in the world

Main Technologies: Mobile Networks (5G), Cloud Computing, AI/ML, IoT, Digital Service Provider Software, Drones….

+ Nokia Bell Labs

• Business Process automation • Improved data and models driven decisions • Algorithm research • AI/ML in products and services

All Nokia Internship and Thesis Worker jobs are advertised on LinkedIn Jobs: https://www.linkedin.com/jobs/ Finland Student Opportunities: https://careers.nokia.com/page/finland-student-opportunities-347

For Summer 2020, the Openings are posted in Jan-Feb 2020, but there are also Jobs open ”all the time”.