DATA SCIENCE RESEARCH
Data Science Fest 22.11.2019
Faculty of Science Data Science Fest 22.11.2019 1 TODAY’S PROGRAMME
1. Jukka K. Nurminen: Software Systems for Data-Intensive Computing
2. Keijo Heljanko: Challenges in Parallel and Distributed Data Science
3. Tong Li: Long-term Mobile App Usage
4. Ghazaleh Kia: Spatiotemporal Data Analysis for Sustainability Science
5. Jussi Kangasharju: Collaborative Networking
6. Indrė Žliobaitė: Data Science and Evolution
Faculty of Science Data Science Fest 22.11.2019 2 TODAY’S PROGRAMME
7. Antti Koskela: Probabilistic Inference, Privacy and Computational Biology
8. Qingsong Guo: DB meets AI - Towards Autonomous Database Systems
9. Giulio Jacucci: Ubiquitous Interaction
10. Leo Leppänen: News Automation
11. Dorota Glowacka / Alan Medlar: Exploratory Search and Personalisation
Faculty of Science Data Science Fest 22.11.2019 3 TODAY’S PROGRAMME
12. Kai Puolamäki: Exploratory Data Analysis
13. Jussi Määttä: MachQu - Probabilistic Machine Learning for Material Design
14. Jörg Tiedemann: Language technology / NLP
15. Antti Ukkonen: Machine learning, language, and online conversations
16. Sasu Tarkoma: Content-centric structures and networking
17. Timo Tenhovuori: Data Science in Nokia
Faculty of Science Data Science Fest 22.11.2019 4 Jukka K Nurminen Professor of Software Systems for Data-Intensive Computing (in Natural Sciences)
• My research focus is on the engineering and operation of AI systems • Testing, maintenance, and other software lifecycle steps for AI systems • Sustainable machine learning platforms ‒ Energy-efficiency ‒ New opportunities, especially quantum computing • Methods and tools to make AI ethics useful for developers • For me computer science is about solving problems • New and relevant challenges are always welcome • Experience in industrial research (Nokia), applied research (VTT), and academic research (Aalto) Email: [email protected] Publications: https://scholar.google.fi/citations?user=cmftgn4AAAAJ LinkedIn: https://www.linkedin.com/in/jukka-k-nurminen/
Department of Computer Science Jukka K Nurminen 21.11.2019 1 WHO IS ELAINE HERZBERG?
Department of Computer Science Data Science Fest / Jukka K Nurminen 22.11.2019 2 Software Life-Cycle Costs - Schach 2002
Department of Computer Science Data Science Fest / Jukka K Nurminen 22.11.2019 3 HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI
Challenges in Parallel and Distributed Data Science
Keijo Heljanko ‹keijo.heljanko@helsinki.fi› Nov 22, 2019
University of Helsinki Department of Computer Science Helsinki Center for Data Science (HiDATA) Big Data
Seagate sponsored IDC study estimates the Global Datasphere to be 33 Zettabytes (33 000 000 000 TB) in 2018 and to grow to 175 Zettabytes by 2025 Data comes from: Video, sensor data, Internet sites, social media, AI Applications, healthcare, . . . For example Netflix is collecting 1 PB of data per month from its video service user behaviour, total data warehouse 60+ PB According to Cisco, Internet traffic volume is growing 30% per year, while Big Data storage is growing 51% per year End of 40 Years of Sequential Computing Improvements
The 40 years of general purpose sequential computing performance increase has slowed to 3% per year [Hennessy&Patterson’18]: Implications of the End of Free Lunch
Data volumes and application requirements are growing much faster than microprocessor performance The number of transistors and cores in a microprocessor is still growing at a high rate Programming models need to change to efficiently exploit all the available parallelism Programs should be made more parallel every year to be able to cope with data volume growth Future Challenge for Parallel and Distributed Systems is to come up with programming models that scale to the massively growing data volumes Research Topics in 2019
Distribution and Parallelization: Pan-Genomics pipelines with Apache Spark Applied Machine Learning: Prediction of Wet-end Breaks in a Paper Machine, Prediction of events in business processes Automated Verification: Concurrency bug finding tool for Linux kernel developers Design of Distributed Systems: Kubernetes based architecture for IoT Edge analytics Theory of Distributed Systems: Tool for proving parameterized systems correct "What Apps Did You Use?": Understanding the Long-term Evolution of Mobile App Usage
Tong LI
22-11-2019
1 Long-term Mobile App Usage Dataset
# of # of # of # of App Attributes Date Area Users Records Apps Categories User ID, apps, time zone, 01/2012 - Worldwide (over 1,465 12,457,867 110,932 32 timestamp, mobile network type 12/2017 80 countries)
2 The evolution of app-category usage
3 The evolution of app-category usage
Development of mobile networks Correlation of app categories
4 The evolution of app usage
Pareto effect
Correlation of apps in ‘News and magazines’ category across Correlation of apps in ‘Social’ category across different 5 different years. years. Summary
The long-term usage evolution of mobile app-category and app exhibits different processes. A complete usage evolution of an app-category has two stages, i.e., a growth stage and a plateau stage. However, apart from the above two stages, apps have one more different stage, i.e., an elimination stage.
The development of technologies will trigger the growth stage in both app categories and apps. This increasing trend will not be influenced by the maturity of app categories and the Pareto effect.
The fierce intra-competition of apps results in the occurrence of the elimination stage of app usage and the decrease in correlations between apps in the same category.
6 SPATIOTEMPORALSPATIOTEMPORAL DATADATA ANALYSISANALYSIS FORFOR SUSTAINABILITYSUSTAINABILITY SCIENCESCIENCE LauraLaura Ruotsalainen,Ruotsalainen, AssociateAssociate ProfessorProfessor GhazalehGhazaleh Kia,Kia, PhDPhD StudentStudent TittiTitti Malmivirta,Malmivirta, ResearchResearch AssistantAssistant
SergeySergey Nikolskiy,Nikolskiy, co-supervisedco-supervised PhDPhD StudentStudent
18/11/2019 1 GROUP’S RESEARCH FOCUS
• Navigation in challenging environments (urban, indoors, Arctic) • Autonomous systems (Road transport, UAVs) • Computer vision methods for navigation and situational awareness • Mitigation of intentional interference of satellite positioning
18/11/2019 2 NAVIGATION IN CHALLENGING SITUATIONS
• Statistical Error Modelling • Recursive Bayesian Estimation • Measurement fusion • Cooperative positioning • Machine Learning • Recognizing environment and motion • Route prediction • Improving computer vision / radio signal processing
18/11/2019 3 AUTONOMOUS TRAFFIC
Traffic deaths 1.35 M per year globally, 10 M injured or disabled 93 % of accidents caused by or contributed to by driver error Greenhouse gas emissions Drones: 20 000 over city / hour / 2035 (EU) Pedestrians / Bicycles Indoor Navigation
18/11/2019 4 60˚60˚ 1010 1.21.2 N,N, 24˚24˚ 5757 1818 EE
18/11/2019 5 Collaborative Networking Prof. Jussi Kangasharju University of Helsinki Research Focus Areas
• Edge and cloud computing • Combining AI and networking • Information-centric networking (ICN) • Internet of Things (IoT) • Green networking Edge Computing
• Cloud computing centralizes all data and computation • Increased data amounts (big data, IoT) make this inefficient • Too much network traffic and processing load at central cloud • Lot of data only of local interest -> No benefit to send to central cloud • Edge and fog computing move processing towards edge of network • Lower latencies for information access and processing • Autonomous driving, AR/VR, etc. • Less wide-area network traffic • Easier to ensure privacy of user data due to limited scope of usage • Edge computing is a key feature of 5G • Our current research topics: • Discovery of edge servers • Edge server placement • Flexible computations Intelligent Containers (ICON)
• Autonomous computing entities • Providing services to users • ICON swarm observes environment • Migrate or replicate closer to users • Application owner tunes behavior of with latency and budget knobs Information-Centric Networking (ICN)
• ICN evolves Internet from host-centric paradigm to an information- centric view of the network (named information) • Supports intermittent connectivity, user mobility, multicast • Key features: Name-based routing, universal caching, built-in security • Our current research topics: • How to discover cached content in a distributed environment? • How to handle mobility of content producers and consumers? • Using ICN for managing distributed data and computation in IoT Contact Information
• Website: http://www.helsinki.fi/collaborative-networking • Email: [email protected] • Twitter: @kangasharju Group leader: Indrė Žliobaitė, Assistant professor, [email protected]
Analyzing changing world Analyzing changes Computational methods for evolving data, in nature and society change detection
Analyzing and interpreting Transparency and accountability the global fossil record in machine learning
Art: Mauricio Anton, Ika Osterblad Concept drift
Model does not ch an ge
Process changes
so u rce: Evo n ik In d u st r ie s
Source: https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwj5w6LqycrhAhWr- yoKHaydCqEQjRx6BAgBEAU&url=https%3A%2F%2Fwww.genetec.com%2Fsolutions%2Fall-products%2Ftraffic- sense&psig=AOvVaw3idS_4zuwXcMJparMR1LXH&ust=1555159241460997 Evolutionary palaeontology Analyzing the global fossil record
Looking at changes in the past (hopefully) helps to understand what's coming in the future and how new emerging ecosystems will work Climate change, societal change and sustainability
novel ecosystems
Photo credit: Markku Ulander, Kayle Reed How to evaluate Do species age? predictive models that adapt?
What evolution optimizes for?
How do we make robust and reliable models?
How do faunal communities rise and fall?
Joint work with many interdisciplinary collaborators
Funding PROBABILISTIC INFERENCE, PRIVACY AND COMPUTATIONAL BIOLOGY
Group leader: Antti Honkela, associate professor
Department of Computer Science Finnish Center for Artificial Intelligence FCAI
Matemaattis-luonnontieteellinen tiedekunta 1 RESEARCH AND TEACHING TOPICS
• Research: Efficient methods for Bayesian inference • Differentially private machine learning and Bayesian inference • Genome sequencing data analysis and modelling • Teaching: • Computational Statistics, period I, every year • Trustworthy Machine Learning (with I. Žliobaitė), period II, every year • Seminar: Machine Learning with Distributed Data, spring 2020
Matemaattis-luonnontieteellinen tiedekunta 2 Privacy-preserving machine learning Di↵erential privacy (Dworkwithet al.,2006) differential privacy (DP) Definition Di↵erential privacy (Dwork et al.,2006) An algorithm operatingResults on a data should set notis said change to be ✏-di too↵erentially much, private even( ✏if-DP) one M • Definition D if for any two data sets and 0,di↵ering only by one sample, the probabilities of person'sAn algorithm dataoperating changes on a data set is said to be ✏-di↵erentially private (✏-DP) D D M D obtaining a result in anyif for set anyS twofulfil data sets and ,di↵ering only by one sample, the probabilities of D D0 Di↵obtainingerential a privacy result in (Dwork any set S etfulfil al.,2006) Pr( ( ) S) ✏ M D 2 Pr( (e). S) M D 2 e✏. Pr( ( 0) SPr() ( ) S) Di↵erential privacyDefinition (Dwork Diet↵ al.Merential,2006)D 2 privacyM D (Dwork0 2 et al.,2006) An algorithm operating on a data set is said to be ✏-di↵erentially private (✏-DP) M D if for any two data sets and 0,di↵ering only by one sample, the probabilities of Definition I Provides protectionDefinitionD againstD adversaries with side information I Provides protectionobtaining against a result adversaries in any set S withfulfil side information An algorithm operatingI Is invariant on a data toAn post-processing set algorithmis said tooperating be ✏-di↵ onerentially a data set privateis said(✏-DP) to be ✏-di↵erentially private (✏-DP) I Is invariant toM post-processingI Degrades gracefully underD compositionPr(M ( ) S) ✏ D if for any two data sets and 0,diif↵ forering any only two databyM one setsD sample,2 and thee .0,di probabilities↵ering only of by one sample, the probabilities of D D Pr( ( 0) DS) D I Degradesobtaining gracefully a result in any under set S compositionfulfilobtaining a resultM inD any2 set S fulfil
Pr( ( ) S) ✏ Pr( ( ) S) ✏ I Provides protectionM D against2 adversariese . with side informationM D 2 e . Pr( ( 0) S) Pr( ( 0) S) I Is invariant toM post-processingD 2 M D 2 I Degrades gracefully under composition Dwork et al. (2006) I Provides protection against adversariesI Provides with protection side information against adversaries with side information
I Is invariant to post-processing I Is invariant to post-processing I Degrades gracefully under composition I Degrades gracefully under composition OUR WORK IN DIFFERENTIALLY PRIVATE ML
• Algorithms for DP machine learning and Bayesian inference • DP learning with distributed data • DP data anonymisation • Applications, e.g. drug sensitivity prediction
Figure: reproducing statistical discoveries from strongly anonymised data
Matemaattis-luonnontieteellinen tiedekunta 4 DB meets AI Towards Autonomous Database Systems
Qingsong Guo UDBMS Group, 2019.11.22 DBMS is a solved problem?
• Over 40 years of DB research • We developed – fantastic algorithms – great systems – clever query processing and storage strategies – a lot of brilliant stuff … • DBMSs have become fast – when you look at TPC-C: we are currently able to execute ~half a million transactions per second – to simple index-lookups, e.g., in a hash table, we are currently at 20 million operations per second DBMS is not solved yet
Take query optimization as an example, DQ (deep reinforcement learning) outperforms the traditional methods
mean sub-optimality of the queries, i.e., “cost(plan from each algorithm) / cost(plan from optimal plan)”, so lower is better Deep Learning to replace humans in the DB-loop
Humans in the DB-loop: – ETL (extract, transform, & load) – schema design – data integration – physical design – knob tuning – index selection – partitioning and replication – designing of database internals, e.g., query optimizer, index structures, data layouts, and storage engines. What are we doing?
Building an autonomous database system (AutoDB) that is capability of self-learning and automatic management of big (multi-model) data.
● Essential properties – Everything in an AutoDB is model (probabilistic model/ML model) – The database is model-driven – The administrative processes are automatic – The database internals are automated Autonomous database system About UDBMS group
UDMBS = Unified Database Management Systems
Prof. Jiaheng Lu [email protected] C211 Exactum
Postdoc: Qingsong Guo([email protected], C210 Exatcum) 4 PhD students 1 Master student
UDBMS: https://www.helsinki.fi/en/researchgroups/unified-database- management-systems-udbms Research group: Ubiquitous Interaction
Courses: Human Computer Interaction Interactive data visualization Seminar on Advanced Topics in HCI
Giulio Jacucci, Professor, PhD Dr Mikko Kytö, Chen He [email protected] Data intensive Interaction Techniques: Mind Search
Content of Interest 1 Atom 2 Timeline of atomic… 3 Neutron 4 Timeline of quantum… 5 Electron 6 Timeline of physical… 7 History of physics 8 Proton 9 History of chemistry
Relevant Irrelevant Intent Modeling Brain Signal Term Relevance Prediction Based Retrieval
Eugster, M. J., Ruotsalo, T., Spapé, M. M., Barral, O., Ravaja, N., Jacucci, G., & Kaski, S. (2016). Natural brain- information interfaces: Recommending information by relevance inferred from human brain signals. Scientific reports, 6, 38580. Affective annotation of media
14 Oswald Barral et al.
Barral, O., et al . (2016). Extracting relevance and affect information from physiological text annotation. User Modeling and User-Adapted Interaction, 26(5), 493-520.
Fig. 4: Experimental task and user interface. The participant selects a news portal of her choice, and browses the news freely. After reading a specific news article, the participant clicks on one of the affective feedback icons (from left to right: “happy”, “sad”, “angry”, and “neutral”). The participant is allowed to provide voluntary feed- back, as well as to change a news portal at any point, by entering a new URL in the text box designated for it. Participants read news articles for 45-60 minutes.
Electrodermal activity was used as the physiological signal, as it has been proved to be indicative of arousal and stimulus novelty (Dawson et al., 2007; Boucsein, 2012). In order to further minimize the intrusiveness of the recording, and given the less binding findings for CSA reported in Experiment 1 (see Section 3), in the present experiment we solely relied on EDA.
4.1 Participants
Twenty-four participants (five females) took part in the study, two of which partici- pated also in Experiment 1. Participants ranged from 23 to 36 years old (M = 29.7). Three participants were postdoctoral researchers, and the rest were students (18 post- graduate, and three undergraduate) from the University of Helsinki and Aalto Uni- versity in Finland. Nine of the participants read news in their native language only, five readBarral the news, bothO., inKosunen, their native and I., foreign & Jacucci language,, andG. ten(2018). only in a No foreign Need to Laugh Out Loud: Predicting Humor Appraisal of Comic Strips Based on language.Physiological In total, 15 differentSignals mother tonguesin a Realistic were reported.Environment. Overall, participants ACM Transactions on Computer-Human Interaction (TOCHI), 24(6), 40. reported high engagement with the content they were reading (M=4.08, on five-level Likert scale), and to not feel intruded by being asked to provide feedback (M = 1.95, on a five-level Likert scale). Two of the participants were left handed, even though only one of them used the computer mouse with the left hand. Participants reported themselves to be physically and mentally healthy. MatkaHupi)–)A)persuasive)mobile) applica5on)for)sustainable)mobility)
An#$Jylhä,$Pe-eri$Nurmi,$Samuli$Hemminki,$Miika$Sirén,$Dinesh$Wijekoon,$Chao$An,$Giulio$Jacucci$ INTRODUCTION) AUTOMATIC) TRANSPORT) MODE) • Aim:% sustainable% urban% mobility% by% mo1va1ng% the% travelers%to%use%more%eco6friendly%transport%op1ons%by% DETECTION) personalized)mobility)challenges% • Based%on%sensor%fusion%(e.g.,%GPS,%accelerometer,%Wi6Fi)% • MatkaHupi:%a%mobile%(Android)%applica1on,%comprising% • Dis1nguishes% between% walking,% cycling,% and% motorized% • Journey%planner% transport% • Automa1c%transport%mode%detec1on% • For% trip% detec1on,% compares% computed% values% to% public% • Challenges%and%rewards% transport%schedule%informa1on%(open%API%by%HSL)% • Feedback%on%CO2%emissions% CHALLENGES) INTERACTION) DESIGN) AND) • Challenges% are% behavioral% goals% for% the% user,% either% set% by% the% system% (e.g.,% walk% for% 7% km,% try% out% the% tram,% reduce% FEATURES) weekly%emissions)%or%by%the%user%(trip%promises)% • GPS%loca1on%and%automa1c%trip%detec1on% • By%comple1ng%challenges,%the%user%earns%badges%and%points% • Main% screen% (Fig.% 1):% overall% emissions,% breakdown% by% transport%mode,%ac1ve%challenges% PILOT)STUDY) • Journey%planner%based%on%HSL%API%User interfaces• 4% week% study,%to 12% subjects%data (7% full% MatkaHupi for,% 5% without% behavior • Trip%history%with%emission%annota1ons% challenges)% • Challenges%and%rewards% • Ques1onnaires,%data%logging,%and%interview% • Automa1c% sugges1ons% for% new% challenges,% e.g.,% beWer% • Challenges%were%met%with%favorable%feedback% routes%and%alterna1ve%means%of%transport.% • Personaliza1on%is%required%change • The% user% can% fine6tune% the% trip% details% (mode% of% • CO2% es1mates% increase% environmental% consciousness% for% transport,%line%number)%aYer%detec1on%if%needed.% some%users% FUTURE)WORK) • More%personalized%challenges% • Address% power% consump1on% issues% of% automa1c% transport% mode%detec1on% • Full6scale%user%study%
An%indicator%shows%the%current%detected%status%of%the%user.% The%user%can%see%how%the%emissions%of%the%current%week%compare%to%the% past%three%weeks.%
Middle%of%the%screen%shows%a%breakdown%of%emissions%per%transport%mode.%
BoWom%of%the%screen%shows%a%summary%of%the%ac1ve%challenges.%
A%journey%planner,%making%use%of% the%open%API%provided%by%HSL,%is% ACKNOWLEDGMENTS) available%for%planning%new%trips.% MatkaHupi% has% been% co6funded% by%the%European%Commission%and% The%user%can%view%a%list%of%past%trips.% EIT% ICT% Labs% and% is% a% joint% research% effort% of% the% UIx%group% A%more%detailed%view%of%ongoing% and% the% Adap1ve% Compu1ng% and%completed%challenges.% group%at%HIIT.%
Figure)1:%The%main%screen%of%MatkaHupi.% Gabrielli, S., Forbes, P., Jylhä, A., Wells, S., Sirén, M., Hemminki, S., Nurmi, P., Maimone, R., Masthoff, J., Jacucci, G., (2014) Design Challenges in Motivating Change for Sustainable Urban Mobility. To appear in Computers in Human Behavior. Jylhä, A., Nurmi, P., Sirén, M., Hemminki, S., & Jacucci, G. (2013). Matkahupi: a persuasive mobile application for sustainable mobility. In Proceedings of the 2013 ACM UBICOM, 227-230. Interactive Data visualization techniques Interactive Data Map
Klouche, K., Ruotsalo, T., Micallef, L., Andolina, S., & Jacucci, G. (2017, March). Visual re-ranking for multi-aspect information retrieval. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval (pp. 57-66). ACM. Pointing while looking elsewhere: Designing for varying degrees of visual guidance during manual input.
Serim, B., & Jacucci, G. (2016, May). In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 5789-5800). ACM. Visual Search and Interactive Intent modelling
Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, and Samuel Kaski. 2014. Interactive intent modeling: information discovery beyond search. Commun. ACM 58, 1 (December 2014), 86-92. Insights generation and sharing in interactive data visualisation
He, C., Micallef, L., Kaski, S., Aittokallio, T., & Jacucci, G. (2017). MediSyn: uncertainty-aware visualization of multiple biomedical datasets to support drug treatment selection. BMC bioinformatics, 18(10), 393. NEWS AUTOMATION
• How can we automatically, in a generic/adaptive way, ... • identify something as newsworthy (~important), • organize the information into a coherent narrative, and • express the information in natural language?
Leo Leppänen – [email protected] Data Science Fest. 22/11/2019 1 DISCOVERY RESEARCH GROUP
• Led by Prof. Hannu Toivonen • Research themes • Computational creativity • Self-aware and self-adaptive systems • Interaction of and with computational, creative agents • Natural language analysis • Natural language generation
Leo Leppänen – [email protected] Data Science Fest. 22/11/2019 2 Exploratory Search and Personalisation research group
We do applied and empirical research related to: • interactive information retrieval • user modelling • reinforcement learning
https://glowacka.org REPAIR: an experimental framework to optimise Bust user experiences Head Body • Calibration on the basis of subjective qualities, e.g. aesthetics • User study: randomised experiments + binary feedback • Model preferences with Bayesian interval regression
8 9 10 11 • Application: neural style transfer (head, bust, body) Style weight Applied Document structure in different
Programming Languages Data Structures and Algorithms Logic Discrete Mathematics Computational Complexity scientific domains Mathematical Software Formal Languages Distributed Computing Numerical Analysis Software Engineering Databases Symbolic Computation Artificial Intelligence Computational Geometry Performance Systems and Control • IR for scientific literature often Multiagent Systems Machine Learning Networking and Internet Information Theory Computational Science Emerging Technologies performed on abstracts Computation and Language Cryptography and Security Multimedia Hardware Architecture Robotics Evolutionary Computing • Abstract representativeness Sound Game Theory Human−Computer Interaction Graphics Operating Systems subfield-specific: more Information Retrieval Social Networking Computer Vision Digital Libraries Computers and Society 2.0 2.2 2.4 2.6 2.8 3.0 theoretical subfields have less KL divergence (bits) representative abstracts
• Section-wise topic distributions Theoretical can infer subfield interrelatedness Assessing writing quality in Wikipedia with neural language models
• Wikipedia is high quality in terms of factual accuracy - but has poor readability and style • How to prioritise which pages to edit to improve writing quality? • Old idea: query likelihood language models from '90s + neural language models (e.g. GPT-2) Exploratory Data Analysis Group
http://www.helsinki.fi/exploratory-data-analysis
Associate Professor Kai Puolamäki [email protected]
Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science
Exploratory Data Analysis Group / Kai Puolamäki 20 November 2019 1 Kai Puolamäki Associate Professor (computer science and atmospheric sciences) Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science Vice Director, Helsinki Institute for Information Technology HIIT Research interests: exploratory data analysis; machine learning; artificial intelligence; statistical robustness in data analysis; data science; analysis of simulated and [email protected] measurement data, with applications to atmospheric sciences. http://www.iki.fi/kaip/ Teaching: e.g., “DATA11002 Introduction to machine learning” Exactum A342 +358 50 5228111
Kai Puolamäki 20 November 2019 2 EXPLORATORY DATA ANALYSIS GROUP
Institute for Atmospheric and Earth System Research (INAR) Department of Computer Science Group leader: Prof. Kai Puolamäki Website: http://www.helsinki.fi/exploratory-data-analysis Research topic: algorithmic and probabilistic methods of artificial intelligence that help expert users – such as physicists – understand large heterogenous data sources and make analytic data-based decisions. Applications to physical data from large- scale measurements and simulations, especially in atmospheric sciences.
Part of the group in DS 2019, Split, Croatia. From the left: Kai Puolamäki, Moritz Lange, Rafael Savvides, Henri Suominen, Anton Björklund
Exploratory Data Analysis Group 20 November 2019 3 RESEARCH HIGHLIGHT: SPARSE ROBUST REGRESSION FOR EXPLAINING CLASSIFIERS
ea t uare egre ion • Robust regression can handle outliers o u t egre ion • Novel robust regression algorithms SLISE • finds the largest subset of data items represented by a linear model to a given accuracy
• Local approximations using SLISE
• replace y-axis with outcomes of a complex elected oint I function and force regression line to pass through a selected points
• Local explanations using SLISE not 2 i 2 • explanation for a black box AI model
Björklund A., Henelius A., Oikarinen E., Kallonen K., Puolamäki K. (2019) Sparse Robust Regression for Explaining Classifiers. In: Discovery Science. DS 2019. LNCS 11828. https://doi.org/10.1007/978-3-030-33778-0_27 [DS 2019 best student paper award]
Exploratory Data Analysis Group 20 November 2019 4 CURRENTLY WORKING ON: MODELLING ATMOSPHERIC SIMULATIONS
estimate of • Large Eddy Simulations (LES) in response model prediction urban environment reliability • Terabytes of data, but small n, large p • Objectives: • Replace LES simulators with AI models ‒ how to estimate the reliability of the AI models • How to understand the data and emergent processes? • How to replace computationally expensive LES simulations with efficient tools, e.g., for city planners? • ... Moritz Lange, Henri Suominen, Rafael Savvides, Emilia Oikarinen, Kai Puolamäki. Joint work with Leena Järvi, Mona Kurppa, and Sasu Karttunen.
Exploratory Data Analysis Group 20 November 2019 5 INTERESTED?
• Ask for MSc topics • Call for summer internships to be announced around January 2020
http://www.helsinki.fi/exploratory-data-analysis
Exploratory Data Analysis Group / Kai Puolamäki 20 November 2019 6 MachQu: Probabilistic Machine Learning for Material Design Jussi Määttä, Postdoctoral Researcher
Joint work with Jyri Kimari, Viacheslav Bazaliy, Teemu Roos, Flyura Djurabekova, Kai Nordlund
Data Science Fest, 22 Nov 2019 Active Learning
Before: simulator calls oracle
After: simulator calls ML, ML maybe calls oracle
Uncertainty quantification! Interpretable ML
Train a large ML model: atomic configuration → physical quantities
Construct a small, interpretable approximation of the model Language Technology Helsinki - NLP http://blogs.helsinki.fi/language-technology/
meaning Jörg Tiedemann speaking Department of Digital Humanities University of Helsinki [email protected] understandingHELSINKI Language Technology the World’s languages Natural Language Understanding
machine natural language paraphrasing translation inference & language learning reasoning error correction
sentiment & How? conversational / emotion interactive AI detection meaning representations creative hate speech / Why? What? language fake news detection generation
data / knowledge digital mining managing humanities & text classification audiovisual linguistics data Language Technology and Deep Learning
learn from logical inference learn from examples translated data
How? meaning representations learn from Why? learn from annotated What? raw data data
learn from multimodal content What you can do in NLP
• enroll in our NLP courses • collaborate with us in a master thesis • work in our research projects
Courses we offer: • https://blogs.helsinki.fi/language-technology/study-info/ • https://blogs.helsinki.fi/language-technology/study/ courses-2019-2020/
Projects we run: • https://blogs.helsinki.fi/language-technology/hi-nlp/crosslingual/ • https://blogs.helsinki.fi/language-technology/resources/ • https://blogs.helsinki.fi/language-technology/project-ideas/ THESIS TOPICS
PROFESSOR SASU TARKOMA Example: Carat Dataset August 2018
● Originated in UC Berkeley, in collaboration with University of Helsinki ● Mobile app for Android and iOS ● Currently over 850 000 users ● >2.5 TB of data, > 250 million measurements ● Research project with many directions ● Publications and open datasets (Open Source clients): http://carat.cs.helsinki.fi
● Carat is the first system to use the device community to detect and correct energy problems ● Our method for diagnosing energy anomalies uses the community to infer a specification (expected energy use), and we call deviation from that inferred specification an anomaly ● Many awards including Mark Weiser Best Paper Award 2015. ● A. J. Oliner, A. P. Iyer, I. Stoica, E. Lagerspetz, S. Tarkoma. Carat: Collaborative energy diagnosis for mobile devices, In Proceedings of ACM SenSys ‘13. WHAT IS MEGASENSE? 5G • Multidiscplinary research program REAL-TIME • Scalable and intelligent real-time air pollution MASSIVE-SCALE monitoring solutions ENVIRONMENTAL SENSING WITH IOT AND AI • Hierarchical architecture with new low-lost sensors • Leverages low-cost air pollution sensors, ML/AI, and versatile connectivity provided by 4G/5G Nokia @ Espoo [email protected] Innovative and International HQ in Espoo
~3 000 employees 50+ nationalities
Nokia headquarter, All business units, Most sustainable city in EU Strong ecosystem Most innovative community in the world
2 © 2019 Nokia Network Technology Company – full coverage
Main Technologies: Mobile Networks (5G), Cloud Computing, AI/ML, IoT, Digital Service Provider Software, Drones….
+ Nokia Bell Labs
3 © 2019 Nokia Data Science work
• Business Process automation • Improved data and models driven decisions • Algorithm research • AI/ML in products and services
4 © 2019 Nokia Career at Nokia
All Nokia Internship and Thesis Worker jobs are advertised on LinkedIn Jobs: https://www.linkedin.com/jobs/ Finland Student Opportunities: https://careers.nokia.com/page/finland-student-opportunities-347
For Summer 2020, the Openings are posted in Jan-Feb 2020, but there are also Jobs open ”all the time”.
5 © 2019 Nokia