Internship Programme 2017 Project Descriptions

The Institute Internship Programme 2017

Contents

Project 1 - Privacy-Preserving Hypothesis Testing ...... 2 Project 2 - Remote Identification of System Failure in “Sensor Poor” Environments ...... 4 Project 3 – Task Based Information Retrieval ...... 6 Project 4 - Improving Renewable Energy Modelling & Forecasting ...... 9 Project 5 - Exploiting Multi-modal Mobile Data Sources for Mental Health Monitoring ...... 11 Project 6 - Distributed Acoustic Sensing for Oil/Water Flow Rates ...... 12 Project 7 - The Extension of ESig Python Package for Mining Sequential Data ...... 14 Project 8 – London Explained: Interactive Tools for Understanding Open Data...... 16 Project 9 - Does Beauty Pay Off Online? Returns to Subjective Attractiveness in Online Freelancing Labour Markets ...... 18

Project 1 - Privacy-Preserving Hypothesis Testing

Project Goal The goal of this project is to a. implement and evaluate a baseline privacy-preserving protocol for common test statistics using existing frameworks for secure computation, b. design and implement optimized custom protocols, and c. evaluate them on real-world datasets.

Project Supervisors David Aspinall (Faculty Fellow, The Alan Turing Institute, ) Adria Gascon (Research Fellow, The Alan Turing Institute, ) Joshua Loftus (Research Fellow, The Alan Turing Institute, )

Project Description Consider the scenario where two organizations wish to cooperate to run a data analysis algorithm on the union of their datasets. However, they have conflicting interests, since they also want to avoid disclosing any unnecessary information. For example, two medical research organizations might be interested in confirming the relation of a genetic variation with a concrete rare disease. While joining their data would benefit the study, regulatory/ethical constraints might prevent them from doing so.

In this project, we will take a secure multi-party computation (MPC) approach to this problem, and focus on hypothesis testing as the data analysis task. The goal is to design and implement distributed protocols that enable organizations to jointly perform a statistical testing analysis with formal guarantees regarding information disclosure.

Number of Students on Project: 1 - 2

Internship Person Specification Essential Skills and Knowledge  Excellent programming skills  Good networking knowledge  Basic cryptography understanding (public and symmetric key encryption)  Basic understanding of statistics

 An interest in theoretical aspects of computer science Desirable Skills and Knowledge  Experience in implementing distributed systems and/or cryptographic protocols  Knowledge of cryptographic approaches to privacy preservation and/or multiparty computation  Either applied or theoretical knowledge of statistics, in particular statistical hypothesis testing

Return to Contents

Project 2 - Remote Identification of System Failure in “Sensor Poor” Environments

Project Goal Statistical benchmarking for the dynamic performance of various engineering subsystems, allowing the identification of failures within “sensor poor” environments.

Project Supervisors Tony Latimer (Siemens) Catalina Vallejos (Research Fellow, The Alan Turing Institute, University College London)

Project Description Gas Turbine “Packages” incorporate a significant number of subsystems required for engine operation. The ideal operating parameters of these engines varies considerably between installations due to various factors (e.g. environmental). The purpose of the project is to develop statistical methodology to provide personalized dynamic benchmarking of individual engines, allowing the identification of faults within the system in environments with inconsistent and varying data availability. This will allow for improvements in, and cost reductions of service delivery to Siemens customers.

This project is a unique opportunity for a motivated student who wants exposure to both academic and business environments. While most of the project will be carried out at the Turing Institute, the internship will involve close collaboration with the Siemens engineering team, including 3 weeks on-site work in Lincoln.

Number of Students on Project: 1

Internship Person Specification Essential Skills and Knowledge  Excellent knowledge of statistics and/or machine learning  Good programming skills in R (and possibly other languages such as C++) Desirable Skills and Knowledge  Basic knowledge or interest in engineering

 Experience developing R packages would be beneficial although training can be provided.

Return to Contents

Project 3 – Task Based Information Retrieval

This project is split into two sub-projects:

3.a - Design of Task Based Information Retrieval Systems

Project Goal The need for search often arises from the need for completing a real task. The goal of this project is to design new search systems that can understand the task the user is trying to achieve, and present them with all the subtasks they need to complete to achieve this.

Project Supervisors Emine Yilmaz (Faculty Fellow, The Alan Turing Institute, University College London) Maria Liakata (Faculty Fellow, The Alan Turing Institute, University of Warwick)

Project Description The need for search often arises from a person’s need to achieve a goal, or a task such as booking travels, buying a house, etc. Contemporary search engines focus on retrieving documents relevant to the query submitted as opposed to understanding and supporting the underlying information needs (or tasks) that have led a person to submit the query. Therefore, search engine users often have to submit multiple queries to the current systems to achieve a single information need. The goal of this project is to devise new retrieval systems that are task based: Given a query, they can understand what the user is trying to achieve and can present them with all the subtasks that need to be completed, etc. In contrast to existing search interfaces, which present the user with a fixed interface for all search queries, this would involve designing new search interfaces that are adaptive and that could differ for each query, depending on the structure of subtasks that need to be completed, given the task the user is trying to achieve.

Number of Students on Project: 1

Internship Person Specification Essential Skills and Knowledge  Human computer interaction

 Information retrieval Desirable Skills and Knowledge  Probability and statistics  Machine learning  Data mining

3.b - Design of Task Based Conversational Agents

Project Goal Conversational agents and intelligent assistants are commonly used by end users in order to complete some real-world task. The goal of this project is to design new conversational agents that can understand the task the user is trying to achieve, and can guide the end user (in a conversational manner) through the completion of their task (by utilizing information about all the subtasks that need to be completed, given a particular task.

Project Supervisors Emine Yilmaz (Faculty Fellow, The Alan Turing Institute, University College London) Maria Liakata (Faculty Fellow, The Alan Turing Institute, University of Warwick)

Project Description Conversational agents and intelligent assistants are commonly used by end users in order to complete some real world task. The goal of this project is to design new conversational agents that can understand the task the user is trying to achieve, and can guide the end user (in a conversational manner) through the completion of their task (by utilizing information about all the subtasks that need to be completed, given a particular task. For this purpose, we will start with using conversational agents that can utilise representations of tasks/subtasks that are already available in a semi-structured form, and if time permits, we will be using representations of tasks that are automatically extracted from query logs.

Number of Students on Project: 1

Internship Person Specification Essential Skills and Knowledge  Natural language processing  Information retrieval Desirable Skills and Knowledge

 Probability and statistics  Machine learning  Data mining

Return to Contents

Project 4 - Improving Renewable Energy Modelling & Forecasting

Project Goal To provide new / improved models for wind and photovoltaic (PV) generation and, using these models, to develop new / improved means to forecast renewable generation output in Great Britain. This will help National Grid to maximize the contribution of renewable energy to the power system.

Project Supervisors Jeremy Caplin (National Grid) Martin Bradley (National Grid) Sebastian Vollmer (Faculty Fellow, The Alan Turing Institute, University of Warwick) Arnaud Doucet (Faculty Fellow, The Alan Turing Institute, Oxford University) Kostantinos Zygalakis (Faculty Fellow, The Alan Turing Institute, University of Edinburgh) Andrew Duncan (University of Sussex)

Project Description Effective forecasting of renewable generation output is vital to keeping the lights on and maximizing its contribution. National Grid has recently acquired two large datasets covering several years’ output from PV installations and small wind farms, and we want to analyse this data to improve our forecasting for PV and embedded wind power. Key objectives are to develop new models that can forecast regional output based on a range of meteorological data, which can be integrated into our operations. We also want to predict any changes in operation of traditional generation that are needed to accommodate increased volumes of renewables.

Number of Students on Project: 2-3

Internship Person Specification Essential Skills and Knowledge  Strong skills in statistical analysis of large time-series datasets  Ability to turn the results of this analysis into useable mathematical models (including assessment of risk and quantification of uncertainty)

Desirable Skills and Knowledge • The ability to develop adaptive models would be an advantage, avoiding the need for periodic manual recalculation of models. • Familiarity with weather forecasting and / or renewable generation • Experience of image processing, in particular processing of satellite imagery would be desirable.

Return to Contents

Project 5 - Exploiting Multi-modal Mobile Data Sources for Mental Health Monitoring

Project Goal The goal of this project is to devise novel machine learning approaches based on multi- modal mobile data sources describing human behaviour (such as sensor information and user generated data) for mood monitoring and prediction.

Project Supervisors Abhinav Mehrotra (The Alan Turing Institute, University College London) Mirco Musolesi (Faculty Fellow, The Alan Turing Institute, University College London) Maria Liakata (Faculty Fellow, The Alan Turing Institute, University of Warwick) Maria Wolters (Faculty Fellow, The Alan Turing Institute, University of Edinburgh) James Cheshire (University College London) Richard Dobson (University College London, Farr Institute) Chiara Garattini (Intel)

Project Description The goal of this exciting project is to build tools to exploit multi-modal mobile data sources for predictive monitoring of mental health conditions. Examples of datasets include GPS and phone interaction information (the data will be fully anonymised). A key component of the project will consist in designing and implementing a library for visualisation and online data analysis. The investigation of the best ways for communicating the results of the analysis back to the users is another important aspect of this project.

Number of Students on Project: 1 – 4

Internship Person Specification Essential Skills and Knowledge • Basic statistical skills. • Good programming skills (ideally in Python and/or R). Desirable Skills and Knowledge  Basic knowledge of machine learning principles

Return to Contents

Project 6 - Distributed Acoustic Sensing for Oil/Water Flow Rates

Project Goal Develop and implement machine learning/computational statistics algorithm that predict flow rates from acoustic signals

Project Supervisors Franz Király (Faculty Fellow, The Alan Turing Institute, University College London) Sebastian Vollmer (Faculty Fellow, The Alan Turing Institute, University of Warwick) Tim Park (Shell)

Project Description In the oil and gas industry we often face the problem of monitoring what our wells are producing. DAS is a new fibre optic measurement technology which can be mounted on the outside of a pipe and measures the sound produced by flowing liquid. This internship will involve taking these acoustic signals and linking them to the flow rates of the fluid and the ratio between oil and water. A particular data scientific challenge in this is dealing with samples of multiple signals and the experimental set-up in which flow rates and engineering parameters are varied in controlled (non-random, non-systematic) ways. This will necessitate exploring (and possibly creating) techniques for feature extraction and supervised learning with sequences such as acoustic signals, as well as methods for transfer learning and/or anomaly detection.

Number of Students Project: 1 – 2

Internship Person Specification Essential Skills and Knowledge  Knowledge of advanced regression models  Knowledge of model comparison  Good programming skills (ideally Matlab, Python and R)  Enthusiasm for applied problems and a practical mind set Desirable Skills and Knowledge

 Experience with preventing overfitting and working with ‘small data’ (acoustic signal recorded for a long time (12h) but only a few different conditions)  Experience with acoustic signals  Experience with hierarchical models, experimental design or otherwise to reduce calibration cost  Knowledge of computational fluid dynamics

Return to Contents

Project 7 - The Extension of ESig Python Package for Mining Sequential Data

Project Goal We have tested C++ software packages for capturing and transforming streamed multi- modal data into “signatures” and for mining that information. The C++ software can be used to capture actions in video, to recognise handwriting, and achieves state of the art. This project aims at broadening the range of python wrappers to the methodology, and creating robust reusable python packages that demonstrate some of the methodology with rich example contexts and extend its accessibility.

Project Supervisors Terry Lyons (Faculty Fellow, The Alan Turing Institute, ) Hao Ni (Faculty Fellow, The Alan Turing Institute, University College London)

Project Description The mathematics needed to describe complex multimodal data streams has generated new tools; there are sophisticated C++ libraries that translate streamed data into highly informative feature sets. Lyons wrapped some of this library functionality into a multi-platform python wrapper (Esig-0.6.4) which dramatically increased dissemination and progress. C++ implementations of math tools are now used in state of the art action recognition, handwriting recognition. This project will extend the python interface at a professional level of quality, and provide simple model/test examples using the interface: e.g. classifying actions in movies, identifying the change of mood in the speech data.

Number of Students on Project: 1 – 3

Internship Person Specification Essential Skills and Knowledge  Excellent coding skills (in C++ and ideally Python)  Ability to organise information and create documentation, tests and examples  Ability to use source control and put the Python packages on the web in standardised format  An enthusiasm for turning abstract science into practical outcomes Desirable Skills and Knowledge

 Knowledge of development, packaging and version control  Experience of Windows platforms

Return to Contents

Project 8 – London Explained: Interactive Tools for Understanding Open Data

Project Goal What would it take to get the public interested in statistics and believe in data analysis? The goal of this project is to build web-based interactive visualizations about London that encourages the reader to check their assumptions, make their own predictions and check how the reality matches their beliefs.

Project Supervisors Tomas Petricek (Research Fellow, The Alan Turing Institute) Brooks Paige (Research Fellow, The Alan Turing Institute, University of Oxford) Maria Wolters (Faculty Fellow, The Alan Turing Institute, University of Edinburgh) Rachel Oldroyd (The Bureau of Investigative Journalism) James Geddes (Research Engineer, The Alan Turing Institute)

Project Description Has statistics really lost its power and are facts backed by data a thing of the past? We do not think so! In this project, we will build a new and more engaging way of presenting data and statistical models that encourages readers to actively explore and understand the data. As in the recent New York times article on Obama’s legacy, readers will be able to make their guesses about the world and causation, before seeing how their guesses fit with the available data. In the first week of the internship, we will closely work with The Bureau of Investigative Journalism to identify a high-impact data set and problem domain to work with. As an example starting point, we could build a visualization that models how different factors influence the demand for school places in London. Is population growth the only factor, or does immigration and economic situation in London make a big difference?

[1] https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy [2] https://www.theguardian.com/books/2016/nov/15/post-truth-named-word-of-the-year-by- oxford-dictionaries [3] https://www.nytimes.com/interactive/2017/01/15/us/politics/you-draw-obama-legacy.html

Number of Students on Project: 1 – 3

Internship Person Specification Essential Skills and Knowledge We are seeking a diverse team of students with complementary skills. You will need one (or more) of the following skills:  Data modelling (probabilistic programming, machine learning or statistical modelling)  Data journalism and data science (Python, R, F#, etc.)  Data visualization and web programming (JavaScript, D3, etc.) Desirable Skills and Knowledge  Experience with building end-to-end data analyses  Human-computer interaction, empirical evaluation and design skills  Excellent programming skills (in data science or software development)  Experience with building interactive data visualizations

Return to Contents

Project 9 - Does Beauty Pay Off Online? Returns to Subjective Attractiveness in Online Freelancing Labour Markets

Project Goal Build new knowledge on the functioning of online labour markets, resulting in a publication.

Project Supervisors Prof Vili Lehdonvirta (Faculty Fellow, The Alan Turing Institute, University of Oxford) Dr Otto Kässi (Researcher, University of Oxford) Dong Nguyen (Research Fellow, The Alan Turing Institute, University of Edinburgh)

Project Description There is quite a bit of evidence that good looks and wages are positively correlated. This observation holds across occupations and countries. In this project, we examine whether the same trend persists when labour is transacted fully digitally, using data obtained via API access from a major online labour market, combined with subjective assessments obtained from an online panel survey. The results of the project will help us better understand how and why subjective evaluations of attractiveness affect labour market outcomes, helping hiring managers to make better decisions and platforms designers to construct more level marketplaces.

Number of students on this project: 1

Internship Person Specification Essential Skills and Knowledge  Knowledge of Regression analysis  Strong skills in R Desirable Skills and Knowledge  Python skills  Experience of Git  Interest in social data science

Return to Contents