Centre for

Data Analytics

Dublin, Ireland September 12, 2014

Book of Abstracts Insight Student Conference

INSIGHT-SC was organised by Aisling Connolly, Bianca Pereira, Erika Duriakova Insight Centre for Data Analytics

Book of Abstracts Insight Student Conference (INSIGHT-SC 2014) Dublin, Ireland

Aisling Connolly, Bianca Pereira, Erika Duriakova Insight Centre for Data Analytics

September 12, 2014 Published by: Insight Centre for Data Analytics https://www.insight-centre.org/content/insight-student-conference-2014

Credits: Cover design: Emir Muñoz Book editor: Bianca Pereira LATEX editor: Emir Muñoz LATEX templates for abstracts: Igor Brigadir using LATEX’s ‘confproc’ package, version 0.8 (by V. Verfaille) Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

ANNUAL INSIGHT STUDENT CONFERENCE -VOL. 1 AN INTRODUCTION Ais Connolly Bianca Pereira Insight Centre for Data Analytics UCD Insight Centre for Data Analytics NUIG [email protected] [email protected] Erika Duriakova Insight Centre for Data Analytics UCD [email protected] Abstract 3).. IF MONEY < 0 THEN ASK (Hugh OR Declan) In this paper we attempt to outline the procedure for or- 4).. IF UCC==EXCEPTION CALL Chrys ganising the Insight student conference. There were three 5).. ELSE GOTO 1) students given the task, two from UCD, one from NUIG. If you are reading this, all must have gone according to plan 5. Evaluation and Results and everything got to print on time1. Initial findings sug- Difficult and all as it was to organise the event, we found gest that the three have raised the number of Insight student 3 conferences from previous years by more than 87%2. that the 93 willing test subjects endured the process much better than anticipated and produced an output of consis- 1. Motivation tent high value research4. Although, audio analysis con- With over 200 researchers spread across 8 institutions it’s ducted shows an exponential increase in sighing through- hard to keep track of us all. We live and work in a fast- out the Insight offices, reaching its peak at preset deadlines. paced, ever-changing environment where cutting edge tech- Several extensions have been facilitated to prove this corre- nology meets daily life. We are creating a data driven so- lation within a 92% confidence interval. ciety, but how are we doing it? In order for the centre as a whole to grow and flourish, knowledge, Insights even, must 6 Conclusion be gained through conversation, collaboration and contact It was all worth it in the end... right?! between the researchers and the institutions. After all, “The On a more serious note. We have found that it takes count- whole is greater than the sum of its parts.” less cups of coffee, dozens of Skype meetings, a plethora of questions, answers, marginally more agreements than dis- 2. Problem Statement and Proposed Solution agreements and a lot of help to organise an event on this It is hypothesised that 84.5% of student level research is scale (I guess this could be extended to the research domain dark matter. Under normal circumstances it is invisible to as a whole), but in saying that, we gained a sense of the size the naked eye and we infer its existence only from its effect of this centre, the importance of the work being carried out on much bigger objects, e.g. PI grants. We have designed a here and, most importantly, the volume and quality of the framework to shine a light upon this research by converging students within the centre. The breadth of knowledge and it in a restricted space-time frame and providing it with a the level of understanding that we have encountered has left big stage. us enamoured. We hope this book of abstracts will leave you feeling the same way. 3. Hypothesis It cannot go without saying that we are part of great sci- entific venture. We, for three, feel privileged to have had Three PhD students with little to no event or conference the opportunity to bring everyone together to experience the organisation experience can totally pull this off. Not only do discovery that has occurred within the first year of Insight. we expect to finally see student research, but in accordance Long may this discovery continue. with the ‘observer effect’ we think this work will have a huge impact on the future of research within Insight and Thank you. probably even spark spontaneous collaborations. 7 Acknowledgements 4. Method This work would not have been possible without the help of The following schematic explains our approach to the prob- the many volunteer postdoc reviewers who vetted the moun- lem. tain of abstracts. We present their work as ours in Appendix B5. 1).. ASK Donnacha for HELP 2).. IF NOT Donnacha GOTO Padraig 3Authors, included in this book 1By on time, we mean... eventually 4Excluding the usual outlier of the ’my-dog-ate-my-harddrive’ type 2A bold claim, we know. 5This is a joke, it’s their work really

INSIGHT-SC [iii] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Program Committee

Achille Zappa Graham Healy • • Aisling Connolly Gregor Schiele • • Alejandro Arbelaez Helmut Simonis • • Amin Ahmadi Kevin Mcguinness • • Aonghus Lawor • Larisa Florea • Aymen Benazouz • Lars Kotthoff • Bianca Pereira • Laura Climent Carlo Manna • • Leonardo Gualano Daniel Kelly • • Marius Kaminskas David Monaghan • • Mustafa Al-Bado Deepak Mehta • • Oscar Manzano Torre Diarmuid Grimes • • Erika Duriakova Oya Beyan • • François Destelle Rachael Rafter • • Georgeta Bordea Ratnesh Sahay • • Gilles Simonin Suzanne Little • •

INSIGHT-SC [iv] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

LISTOF ABSTRACTS Student Track 1 Adaptive Stream Query Processing Framework for Linked Stream Data Zia Ush Shamszaman, Muhammad Intizar Ali, Alessandra Mileo 2 Adaptive Worker Assignment to Spatial Crowdsourcing Tasks Umair Ul Hassan, Edward Curry 3 An Analysis Framework for Content-based Job Recommendation Xingsheng Guo, Houssem Jerbi, Michael P. O’Mahony 4 Analysis of the Semi-Synchronous Approach to Large-Scale Parallel Community Finding Erika Duriakova, Neil Hurley, Deepak Ajwani, Alessandra Sala 5 Asymmetry in Athletic Groin Pain Patients and Elite Rugby Union Players using Analysis of Characterising Phases Shane Gore, Chris Richter, Brendan Marshall, Andrew Franklyn-Miller, Eanna Falvey, Kieran Moran 6 Automatic Classification of Knee Osteoarthritis Images with Reference to Kellegren-Lawrence Radiographic Scores Joseph Antony, Kevin McGuinness, Noel E. Oconnor, Kieran Moran 7 Automatic Complex Event Implementation System Feng Gao, Edward Curry 8 Bayesian Inference for Exponential Random Graph Models Using Composite Likelihoods Lampros Bouranis, Nial Friel, Florian Maire 9 Bayesian Low-rank Matrix Completion: General Sampling Distribution and Optimal Rate The Thien Mai, Pierre Alquier 10 A Bid Profiling Tool for Facebook Advertising Campaigns Ewa Mlynarska, Derek Greene, Pádraig Cunningham 11 Building a User Profile from Automatic Human Activity Recognition in Lifelogging, for a Personal Recom- mender System Stefan Terziyski, Rami Albatal, Cathal Gurrin 12 Cardiorespiratory Fitness and Vascular Function in Youth Sinead E. Sheridan, Niall M. Moyna 13 The Classification and Evaluation of Exercise Technique using Inertial Sensors and Depth Cameras Martin O’Reilly, Darragh Whelan, Tomás Ward, Brian Caulfield 14 Considering Uncertainty in Data - A Practical Example from the Forestry Industry Adejuyigbe O. Fajemisin 15 Data Analytics for Assessing Financial Incentives Yulia Malitskaia, Barry O’Sullivan 16 A Data Driven Approach to Determining the Influence of Fatigue on Movement Variability and Associated Injury Risk in Chronic Ankle Instability Alexandria Remus, Eamonn Delahunt, Kevin Sweeney, Brian Caulfield 17 DWARF Compression and NoSQL: The Future of XML Analytics Michael Scriney, Mark Roantree 18 Dynamic User Authentication Based on Mouse Movements Curves Zaher Hinbarji, Rami Albatal, Cathal Gurrin

INSIGHT-SC [v] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

19 Energy Efficiency in Smart Homes Oscar Manzano Torre 20 Entity Linking with Multiple Knowledge Bases Bianca Pereira 21 Every Aspect Counts: A Combined Approach Towards Aspect Based Sentiment Analysis Sapna Negi 22 Ethics of Ambient Assisted Living Technologies for People with Dementia Peter Norvitzky, Alan F. Smeaton, Cynthia Chen, Kate Irving, Tim Jacquemard, Fiachra O’Brolcháin, Dónal O’Mathúna, Bert Gordijn 23 A Feature-based Analysis for Music Mood and Genre Classification Humberto Jesús Corona Pampín, Michael P. O’Mahony 24 Federated Autonomic Trust Management Samane Abdi 25 Finding Bounded Disjoint Paths in a Very Large Spatial Data for Optical Networks Ata Sasmaz 26 Further Experiments in Sentimental Product Recommendation Ruihai Dong, Michael P. O’Mahony, Barry Smyth 27 Improving Government Policy & Decision Making with Social Data Lukasz Porwol 28 Insight4News: Connecting News to Relevant Social Conversations Bichen Shi, Georgiana Ifrim, Neil Hurley 29 Integration of Terminology into the CAT Environment Mihael Arcan 30 A Latent Space Analysis of User Lifecycles in Online Communities Xiangju Qin, Derek Greene, Pádraig Cunningham 31 Learning with Subsets of the Data Aidan Boland, Nial Friel 32 Low Cost Autonomous Sensing Platforms for Water Quality Monitoring Deirdre Cogan, John Cleary, Cormac Fay, Dermot Diamond 33 Machine Learning in Portfolio Solvers John Horan 34 Maintenance Scheduling through Degradation Monitoring for Building Service Components Ena Tobin 35 Making Value Out of Lifelogs Gunjan Kumar, Houssem Jerbi, Michael P. O’Mahony 36 Managing the Trade-Off Between Response Time and Quality in Hybrid SPARQL Query Processing Based on Response Requirements Soheila Dehghanzadeh 37 Mining Opinions from User-Generated Reviews for Recommender Systems Khalil Muhammad, Aonghus Lawlor, Rachael Rafter, Barry Smyth 38 Multi-Modal Continuous Human Affect Recognition Haolin Wei, David S. Monaghan, Noel E. O’Connor 39 Negative FaceBlurring: A Privacy-by-Design Approach to Visual Lifelogging with Google Glass Tengqi Ye, Brian Moynagh, Rami Albatal, Cathal Gurrin 40 Non Invasive Detection of Biological Fluids Giusy Matzeu, Conor O’Quigley, Eoghan McNamara, Cormac Fay, Dermot Diamond

INSIGHT-SC [vi] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

41 Object Segmentation in Images Using EEG Signals Eva Mohedano, Graham Healy, Kevin McGuinness, Xavier Giró-I-Nieto, Noel E. O’Connor, Alan F. Smeaton 42 Occupant Location Prediction Using Association Rules Conor Ryan 43 Periodicity Detection in Lifelog Data Feiyan Hu, Alan F. Smeaton, Eamonn Newman 44 Proactive Workload Consolidation for Reducing Energy Costs Milan De Cauwer 45 Probabilistic Analysis of Latent Space Models for Social Networks Riccardo Rastelli, Nial Friel, Adrian Raftery 46 Quality on the Web of Data Emir Munõz

47 Random Manhattan Indexing: A Randomized Scalable Method for Semantic Similarity Measurement in L1 Normed Spaces Behrang Q. Zadeh 48 Real-time Algorithm Configuration Tadhg Fitzgerald 49 Realtime Keyword Specific Mining of Social Media Daniel Merrick 50 Real-Time Predictive Analytics Using Open Data and the Web of Things Wassim Derguech, Eanna Burke, Edward Curry 51 Reversible Photo-Actuated Hydrogels for Micro-Valve Applications Aishling Dunne, Larisa Florea, Dermot Diamond 52 Self-Propelled Ionic Liquid Droplets Wayne Francis, Larisa Florea, Dermot Diamond 53 Social Media Feeds Clustering in Heterogeneous Information Networks Narumol Prangnawarat, Hugo Hromic, Ioana Hulpus, Conor Hayes 54 Thematic Event Processing for Large Scale Internet of Things Souleiman Hasan 55 Tracking and Recommending News Doychin Doychev 56 Tracking Breaking News on Twitter with Neural Network Language Models Igor Brigadir 57 Ultimate Search: Visual Object Retrieval Zhenxing Zhang, Cathal Gurrin, Alan F. Smeaton 58 Use of Inertial Sensors and Depth Cameras to Classify Performance Using Functional Screening Tools Darragh Whelan, Martin O’Reilly, Eamonn Delahunt, Brian Caulfield 59 Using Third Level Educational Data to Help "At Risk" Students Owen Corrigan 60 Variable Selection for Categorical Data Clustering Michael Fop, Thomas Brendan Murphy

Intern Track 61 Deep Learning for High-Dimensional and Sparse Clinical Study Data Jim O’Donoghue, Mark Roantree 62 Development of an Analytics Framework for User-Generated Multi-Lingual Data John Lonican, Brian Davis, Siegfried Handschuh, Conor Hayes

INSIGHT-SC [vii] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

63 Diversity-Aware Top-N Recommender Systems Andrea Barraza-Urbina, Conor Hayes 64 Educational Data Analytics John Brennan 65 A Keyword Sense Disambiguation Based Approach for Noise Filtering in Twitter Sanjaya Wijeratne, Bahareh R. Heravi 66 Low Cost Motion Capture Integrated into Home Based Serious Games Andrew Daly, David S. Monaghan 67 Measuring Scientific Impacts of Research through Altmetrics Mohan Timilsina, Vaclav Belak, Conor Hayes 68 Modeling Causality for Big Data Dalila Messedi 69 Multi-Agent Decision Making Over Wireless Network Martin Bullman 70 Multimodal Human Motion Capture and Synthesis Marc Gowing, David S. Monaghan, Noel E. O’Connor 71 Obesity, Vaping & Referendums - Opinion Mining & Social Media Analysis Ayokunle Adeosun 72 Predicting Retweets Daniel-Emanuel Lal 73 A Scalable Adaptive Method for Complex Reasoning over Streams Thu-Le Pham, Alessandra Mileo 74 The Stable Marriage Problem Aodhagán Murphy 75 Towards Automatic Activity Classification During a Sports Training Session Edmond Mitchell, Noel E. O’Connor 76 Visualizing Geographic Ride-Sharing Data Yves Sohege

77 List of Keywords

INSIGHT-SC [viii] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Adaptive Stream Query Processing Approach for Linked Stream Data Zia Ush Shamszaman, Muhammad Intizar Ali, Alessandra Mileo National University of Ireland, Galway zia.shamszaman, ali.intizar, alessandra.mileo @insight-centre.org { }

1. Motivation and Problem Statement mance and quality of results. We consider differences and Over the last few years several stream query proces- similarities of existing engines at a granular level and in- sors have been proposed for efficient processing of Linked vestigate how data properties and application requirements Stream Data (LSD). LSD are generated through numerous affects those dimensions. modalities and different levels of granularity and refer to the streaming data published over the (Semantic) Web adhering 4. Proposed Approach and Discussion to the Linked Data principles. Data in LSD arrive in con- We have devised an initial set of dimensions by per- tinuous fashion usually at short intervals with a time stamp forming detailed analysis of state-of-the-art engines and and the most recent data are more relevant for stream data their fine-grained characteristics. These dimensions include processing. Multiple aspects can affect the performance and query execution strategy, time model, abstract operators for correctness of the results produced by the query processors, the processing model, quality of service, log management including operational semantics of linked streams, query and privacy requirements. As a next step, we plan to de- execution method, and target domain. However existing sign a model and conceptual architecture based on our adap- approaches lack the adaptability to changing requirements tive stream query processing approach. We believe new of the applications and properties of the underlying data concepts and algorithm should be defined for the adaptive streams. The goal of our research is to design a more flex- stream query processing approach while considering our de- ible and adaptive stream query processing approach which vised initial set of dimensions. enables existing stream query processing solutions to adapt Recent initiatives such as RDF Stream Processing (RSP) according to the requirements of the applications and to in the W3C community aims at bridging the gap between the characteristics of the data streams. The adaptive nature Linked Data and stream processing and provide a common of our approach will support efficient processing of larger standard for producing, transmitting and querying RDF amount of data to serve a broader category of real-time ap- streams. We intend to contribute to RSP in designing a plications efficiently. Numerous approaches [1, 2, 5] have standard model for LSD and make sure that it can support been proposed based on SPARQL-like query languages to adaptive stream query processing approach. harvest LSD processing in Resource Description Frame- work (RDF) and related formats. While each existing ap- proach has advantages, none of them wins in diverse set- References tings. They differ on a wide range of aspects including the [1] D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and execution method, operational semantics, streaming opera- M. Grossniklaus. C-sparql: Sparql for continuous querying. In Proc. of the 18th, World Wide Web Conf., pages 1061– tors and more. 1062, New York, NY, USA, 2009. ACM. [2] A. Bolles, M. Grawunder, and J. Jacobi. Streaming sparql 2. State-of-the-art extending sparql to process data streams. In Proc. of the 5th Considering state-of-the-art solutions, recent evaluations ESWC: Research and Applications, ESWC’08, pages 448– by [3, 4, 6] show that C-SPARQL [1] has clear advan- 462, Berlin, Heidelberg, 2008. Springer-Verlag. [3] D. Dell’Aglio, J.-P. Calbimonte, M. Balduini, O. Corcho, and tage over others in terms of producing accurate results, but E. D. Valle. On correctness in rdf stream processor bench- still occasionally suffers from duplicate results for simple marking. In ISWC (2), volume 8219 of LNCS, pages 326–342. queries and incomplete results for complex queries. This Springer, 2013. diversity in output results is true for other processors includ- [4] D. Le-Phuoc, M. Dao-Tran, M. Pham, P. Boncz, T. Eiter, ing EP-SPARQL, StreamingSPARQL and SPARQLstream. and M. Fink. Linked stream data processing engines: Facts On the other hand CQELS [5] provides better performance and figures. The Semantic Web–ISWC 2012, pages 300–312, in terms of throughput and functionalities [4]. 2012. [5] D. Le-Phuoc, M. Dao-Tran, J. Xavier Parreira, and M. Hauswirth. A native and adaptive approach for unified 3. Research Question and Hypothesis processing of linked streams and linked data. The Semantic In this research, we investigate existing LSD processors Web–ISWC 2011, pages 370–388, 2011. to selectively combine their strength based on the applica- [6] Y. Zhang, M.-D. Pham, O.´ Corcho, and J.-P. Calbimonte. Sr- tion requirements and data properties. Our main goal is bench: A streaming rdf/sparql benchmark. In ISWC (1), pages to bridge the gap between LSD query processing and real 641–657, 2012. world applications by creating an adaptive layer which al- lows to react to changing requirements for better perfor-

INSIGHT-SC [1] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Adaptive Worker Assignment to Spatial Crowdsourcing Tasks Umair ul Hassan, Edward Curry Insight Centre for Data Analytics umair.ul.hassan, ed.curry @insight-centre.org { } Abstract 5 Hypothesis We introduce the “online worker assignment” problem that By learning the worker preferences in terms of the spa- aims to assign a worker to each dynamically arriving spa- tial attributes of tasks and worker situation, we can design tial task. The objective of assignment algorithm is to maxi- effective algorithms for choosing appropriate workers such mize the number of tasks accepted by assigned workers. The that number of accepted tasks is maximized over time. algorithm must trade-off between exploring the behavior of workers or exploiting known best workers. We propose and 6 Proposed Solution evaluate an algorithm based on worker preferences. We propose a framework that formulates the online worker assignment as the multi-armed bandit problem. We 1 Motivation propose SpatialUCB algorithm that chooses the worker Spatial crowdsourcing involves tasks that have an asso- with highest upper confidence bound on the expectation of ciated physical location. A spatial task might require the task acceptance. A confidence interval on the expectation of worker to travel to the associated location for performing it. task acceptance is computed, based on each worker’s pref- Examples of spatial crowdsourcing include photo collection erences. A worker’s preferences are modeled as linear func- in a disaster hit area or traffic reporting in a large city. tion of spatial attributes of task and worker situation.

2 Problem Statement 7 Evaluation We evaluate the performance of SpatialUCB algorithm Intelligently matching tasks with the best workers is a against two baseline algorithms, on a real-world dataset. fundamental issue of spatial crowdsourcing. The matching The ε Greedy algorithm greedily assigns the best worker process can either follow a pull approach by providing a − for 1 ε of tasks, otherwise samples a random worker. In suitable environment for workers to find tasks themselves, − Softmax algorithm a worker is chosen with the probabil- or a push approach that algorithmically matches tasks with ity that is proportional to the observed acceptance rate of workers. We focus our attention to a push approach that worker. Figure 1 suggests that SpatialUCB performs best in aims to maximize the completion of tasks by choosing ap- terms of the total number of accepted tasks. propriate workers for tasks.

3 Related Work Existing assignment models focus on either inference of correct answers for tasks [1], maximization of successful assignments [3], or maximization total number of assign- ments [2]. However, these models do not consider worker preferences while making assignment decisions. To address this issues we propose the online worker assignment model.

4 Research Question In our model, the crowdsourcing platform has a pool of Figure 1: Comparison of algorithms on 26K spatial tasks registered workers. Spatial tasks are submitted to the plat- form one at a time. An appropriate worker must be assigned References [1] C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assign- to each task. However, workers have preferences over the ment for crowdsourced classification. In Proceedings of the tasks they are willing to perform. The preferences are ini- 30th International Conference on Machine Learning (ICML- tially unknown to the platform, but can be learned over time 13), pages 534–542, 2013. by observing outcome of assignments. The primary goal [2] L. Kazemi and C. Shahabi. Geocrowd: enabling query an- of the assignment algorithm is to choose a worker for each swering with spatial crowdsourcing. In Proceedings of the task; such that, the total number of tasks accepted by as- 20th International Conference on Advances in Geographic In- signed workers is maximized over time. This posses an formation Systems, pages 189–198. ACM, 2012. exploration-exploitation trade-off; the algorithm must learn [3] A. Mehta and D. Panigrahi. Online matching with stochastic rewards. In IEEE 53rd Annual Symposium on Foundations of worker behavior by sampling different workers, but should Computer Science, pages 728–737. IEEE, 2012. also assign tasks to the know best workers.

INSIGHT-SC [2] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

An Analysis Framework for Content-based Job Recommendation Xingsheng Guo, Houssem Jerbi and Michael P. O’Mahony

University College Dublin {xingsheng.guo, houssem.jerbi, michael.omahony}@insight-centre.org

Abstract representation (bag-of-words (BoW), entities (EN) and This paper presents several personalised content-based social tags (ST)) used in traditional content-based and case-based approaches to job recommendation, recommenders, where jobs are represented in the Vector investigating a number of feature-based item Space Model. Secondly, case-based approaches are representations, along with a variety of feature considered, where structured job representations are weighting schemes. A comparative evaluation of the proposed using a well-defined set of features various approaches is performed using a real-world, (categories of entities) and feature values (entities), and open source dataset. similarities are assessed based on IR-like feature weighting schemes (DF and IDF). Finally, a hybrid 1. Motivation approach to job representation is proposed, which Online recruitment websites allow job seekers to integrates the known BoW feature along with well- advertise their profiles and provides an opportunity for structured set of features (e.g., job location, required employers to advertise their job postings. Job seekers experience, education, etc.) are usually confronted with a huge amount of job opportunities and it is common that a job seeker 5. Evaluation navigates through different online recruitment websites We conducted our experiments using a real-world without finding a suitable job. Recommender systems publicly available dataset1. Job features were extracted can act as an electronic recruitment assistant and help from the job title, description and requirements. For job seekers exploring jobs available online. content-based approaches, different job representations using various feature types (BoW, EN and ST) were 2. Research Question investigated. In particular, we employed TF-IDF Our preliminary studies focused on an in-depth cosine-based similarity for BoW features and Jaccard exploration of various content-based and case-based for EN and ST features. For case-based approaches, a approaches with different item representations and number of variations were considered; for example, feature types for job recommendation. using categories of entities and explicit features (e.g. education level and job location). Our experiments 3. Related Works demonstrated that a hybrid approach (combining both Recent research has focused on the job BoW and categories of entities, and weighted by DF recommendation task. For example, in [1] jobs are weighting scheme) performed best. recommended based on a graph constructed from the previously observed job transition patterns of users 6. Future Works based on various features (e.g., employer sector and We will extend our job recommendation framework size, employee experience and education, etc.). A by considering alternative job representations; for supervised machine learning approach was then adopted example, by representing jobs based on the features of for job recommendation. In [2], a graph-based hybrid applicants. An adapted domain-specific feature miner recommender is proposed which considers both will be also considered. Moreover, we will consider content-based user and job profile similarities and collaborative-filtering style approaches for this task. interaction-based activities (e.g., applying to a job). Personalised recommendations of candidates and jobs 7. References are then generated using a PageRank-style ranking [1] I. Paparrizos, B. Cambazoglu, and A. Gionis. “Machine algorithm. Further, a collaborative filtering approach learned job recommendation”, 5th ACM Conference on based on implicit profiling techniques was proposed in Recommender Systems, pp. 325–328, 2011. [3] to deliver personalised, query-less job [2] L. Yao, S. E. Helou, and D. Gillet. “A recommender recommendations to users. To date, research in the area system for job seeking and recruiting website”, 22nd of job recommendation has not considered an in-depth International Conference on World Wide Web Companion, analysis of content-based nor case-based approaches. pp. 963–966, 2013. [3].R. Rafter, K. Bradley, and B. Smyth. “Automated collaborative filtering applications for online recruitment 4. Proposed Solution services”, International Conference on Adaptive Hypermedia Firstly, a number of content-based approaches to job and Adaptive Web-based Systems, pp. 363–368, 2000. recommendation are proposed. In particular, we consider variations on the unstructured item 1 https://www.kaggle.com/c/job-recommendation/data

INSIGHT-SC [3] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Analysis of the Semi-synchronous approach to Large-scale Parallel Community Finding Erika Duriakova, Neil Hurley Deepak Ajwani, Alessandra Sala University College Dublin Bell Laboratories, Dublin [email protected] [email protected] [email protected] [email protected]

Abstract on both convergence time and quality of the found com- Community-finding in graphs is the process of identifying munities. We study the impact of ordering on the parallel highly cohesive vertex subsets. In this work we show that LP algorithms implemented on a shared-memory multicore the number of iterations and the quality of the resulting architecture. In an asynchronous LP approach, a vertex up- clusters produced by our parallel approach is invariant to date uses the most recent state as generated by the previous architecture and systems layer issues. update of vertices in its neighbourhood. In parallel settings, this approach can encounter race conditions that impact on 1. Motivation its performance and destroy the consistency of the results. The problem of identifying communities in large and complex networks has received ample attention in recent 6. Proposed Solution years, owing to its application in the analysis of social net- A semi-synchronous approach to Label Propagation works, citation networks, the world-wide web, biological computation ensures that only non-conflicting vertices are networks and so on. updated simultaneously. We explore the use of independent sets to compromise between processor data independence and good mixing of community labels during the course of 2. Problem Statement the algorithm. This so-called semi-synchronous approach The most basic community-finding algorithm is the la- was previously proposed in [1] but not fully implemented bel propagation (LP) algorithm [2]. LP, in each iteration, and evaluated on a parallel architecture. assigns to a vertex the most frequent label among its neigh- bours and itself. When vertices are distributed over a par- 7. Evaluation allel machine, some approaches to Label Propagation com- We report results obtained using two social network putation can encounter race conditions that impact on its graphs (Pokec and Orkut) and the UK 2002 network. We performance and destroy the consistency of the results. learned that the results on the social network graphs are highly volatile. On the other hand, the UK 2002 graph 3. Related Work shows much more stability. When increasing degree order is The state-of-the-art, parallel and distributed implementa- used, generally there is significantly more work done in or- tions of LP algorithms have been previously implemented. der for LP to converge, which leads to large run-times. The For instance, three approaches to the parallelisation of LP construction of independent sets in increasing degree or- algorithm for community finding on shared-memory archi- der produces higher number of independent sets and hence tectures have been evaluated in [3]. Another distributed im- more synchronisation points for the semi-synchronous algo- plementation of a constrained variant of the LP algorithm rithm. Our results can be more generally useful for design- has been used in the graph partitioning system for Face- ing and engineering many other algorithms in the vertex- books People You May Know (PMYK) service [4]. centric computation model, where the mixing rate of labels is an important concern. 4. Research Question References In this paper, we demonstrate that, for social network [1] G. Cordasco and L. Gargano. Label propagation algorithm: a graphs with highly non-uniform degree distributions, the semi–synchronous approach. International Journal of Social vertex update ordering has a large impact on the run-time Network Mining, 1(1):3–26, 2012. of the analysis and the quality of the resulting communities. [2] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time It is thus a key parameter of the computation. This suggests algorithm to detect community structures in large-scale net- that analysts wishing to explore the structural properties of works. Physical Review E, 76(3):036106, 2007. the network should be able to control this update order, in- [3] C. L. Staudt and H. Meyerhenke. Engineering high- performance community detection heuristics for massive dependently of the details of the machine on which the al- graphs. In Parallel Processing (ICPP), 2013 42nd Interna- gorithm is run. tional Conference on, pages 180–189. IEEE, 2013. [4] J. Ugander and L. Backstrom. Balanced label propagation 5. Hypothesis for partitioning massive graphs. In Proceedings of the sixth ACM international conference on Web search and data min- Because of the heavy-tailed degree distribution, the or- ing, pages 507–516. ACM, 2013. der in which vertex updates are applied can greatly impact

INSIGHT-SC [4] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Asymmetry in Athletic Groin Pain Patients and Elite Rugby Union Players using Analysis of Characterising Phases

Shane Gore1,2,3, Chris Richter1,2,3, Brendan Marshall1,2,3, Andrew Franklyn-Miller1, Eanna Falvey1, Kieran Moran2,3

1Sports Medicine Department, Sports Surgery Clinic, , Dublin, Ireland 2School of Health and Human Performance, Dublin City University, Dublin, Ireland 3Insight Centre for Data Analytics, Dublin City University, Dublin, Ireland

[email protected]

Abstract Subsequently, normalized asymmetry (Asym) was This study compared inter limb asymmetry between calculated at each time point (t): field sports players with athletic groin pain (AGP) and rugby union (RU) players. Analysis of characterising

phases was utilised to identify significant differences in asymmetry between the two groups. The AGP group displayed significantly greater inter limb asymmetries The registered mean curves were shifted by the in hip ab/adduction moments compared with the RU minimum value within both curves (δ), making values group (p<0.01) suggesting that an aspect of positive. An independent t-test (α=0.05) examined rehabilitation for AGP should focus on reducing subject scores generated by an analysis of asymmetric hip moments. characterising phases (Richter et al., 2013). Key phases were identified using a VARIMAX rotation principal 1. Introduction component. AGP is a common injury in sports involving repetitive twisting, kicking and turning movements such as occur 3. Results/Discussion: in field sports (Quinn, 2010). In soccer, for example The primary finding was that the AGP group displayed AGP accounts for 8–18% of all injuries (Hölmich, significantly greater ILA in hip ab/adduction moments 1998). Inter limb asymmetry (ILA) may be of compared with the RU group (p<0.01) see table (1). relevance in the investigation of athletic groin pain (AGP) as it is a suggested risk factor for other lower Table 1: Phases of significant asymmetries between the extremity injuries. To date, no studies have investigated AGP and RU group. if this is the case.Traditionally within biomechanics % key Asymmetry discrete data points are chosen for statistical analysis. Variable P value phase Summary As noted by Richter et al., (2013) however this can result in discarding more than 98% of the data. The aim Hip ab-adduction 2-5 0.001 RU > AGP of this study is to compare ILA in rugby union (RU) moment 8-11 0.011 AGP > RU players to that of AGP patients using a continuous data 17-32 0.010 AGP > RU approach. It was hypothesised that the AGP group 38-100 0.000 AGP > RU would demonstrate significantly greater ILA. Hip int-ext moment 3-9 0.004 RU > AGP

2. Methodology Whilst it is unknown if these ILAs have a causative 15 field sports players with AGP (age, 25.6±5.3 relationship with AGP, ILAs are generally considered years; height, 181.4±6.5 cm; mass, 82.7±11.8 kg; time undesirable in sports injury literature. As such, an with AGP, 50.8±46.2 weeks) and 15 elite injury free aspect of rehabilitation for AGP should focus on RU players, were recruited (age 20.4±1.0 years; height reducing asymmetric hip ab/adduction moments. 186.2±7.6 cm; mass 98.4±9.9 kg). Testing involved three trials on each leg for a running cut (75°). Eight 4. References infra-red cameras (Vicon Bonita B10, UK), 1] Quinn, A. "Hip and groin pain: physiotherapy and synchronized with two force platforms (AMTI, rehabilitation issues."Open Sports Medicine Journal 4.1 BP400600, USA) collected data using Plug in Gait (2010): 93-107. 2] Ramsay, J O. Functional data analysis. John Wiley & marker locations (Vicon, UK). Marker and force data th Sons, Inc., 2006. were filtered using a 4 order Butterworth filter at 15 3] Richter, C., N. E. O'Connor, B. Marshall, and K. Moran. Hz. Hip, pelvis angles and hip moment waveforms were "Analysis of Characterizing Phases on Waveforms-An manually landmark registered (using dynamical time Application to Vertical Jumps.”Journal of applied warping) or phase shift registered (Ramsay, 2006). biomechanics (2013).

INSIGHT-SC [5] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Automatic Classification of Knee Osteoarthritis Images with Reference to Kellegren-Lawrence Radiographic Scores Joseph Antony, Kevin McGuinness, Noel E OConnor and Kieran Moran Insight Centre for Data Analytics Dublin City University [email protected]

Abstract § First four moments of mean, standard deviation, In this paper we propose a method to automatically skewness, kurtosis of the pixel intensities o o o o classify the knee osteoarthritis radiographic images computed in directions (0 , 45 , 90 , 135 ) and with reference to Kellegren - Lawrence (KL) scores values sampled into a 3-bin histogram. provided by the Osteoarthritis Initiative (OAI). KL § Radon transforms computed in different o o o o radiographic grading scores (0 – 4) are commonly used directions (0 , 45 , 90 , 135 ) and convolved to define the presence of and to estimate the severity of into a 3-bin histogram. osteoarthritis. After pre-processing the images and § Chebyshev statistics, edge and object statistics. segmenting the knee-joint region, several features § Histogram of Oriented Gradients. based on raw pixel data, textures, edge structures, In order to extract more image descriptors, the statistical distribution of pixel values, polynomial algorithms are applied not only on the raw pixels, but decomposition of the images and transforms of the also on several transforms of the image such as FFT, images were extracted. The extracted features are used Wavelet (Symlet 5, level 1) two-dimensional to train the classifier. decomposition of the image and Chebyshev transform. Among the extracted features, not all are assumed to 1. Introduction be equally informative. In order to select the most Osteoarthritis (OA) also known as degenerative joint informative features, each image feature can be disease is a group of mechanical abnormalities assigned a simple Fiser score or Pearson correlation involving degradation of joints, including articular score. After each image feature is assigned with a cartilage and sub-chondral bone [3]. Although knee weight, the most informative 15% of the features are osteoarthritis is a major public health issue causing selected for the analysis. chronic disability, there is no objective or accurate method for measurement of the severity in general 4. Classification clinical practice [2]. Here we are proposing a method to A multiclass SVM classifier or multiple SVM automatically classify the knee osteoarthritis images classifiers to classify amongst successive grades can be with reference to the Kellegren-Lawrence scores (0 – used to classify the feature vectors. The classification 4), where the extremities ‘0’ corresponds to ‘no OA’ can be approached as a regression problem as well in and ‘4’ corresponds to ‘severe OA’ [1]. this case. This is still an on-going work.

2. Experimental Data 5. Acknowledgement The dataset required for the experiments: Baseline This publication has emanated from research data sample of 200 progression and incidence cohort conducted with the financial support of Science subjects under the knee osteoarthritis study was Foundation Ireland (SFI) under grant number obtained from the Osteoarthritis Initiative, University of SFI/12/RC/2289 California, San Francisco. 6. References 3. Features [1] Lior Shamir, David T. Felson, Luigi Ferrucci, Ilya G. The MRI images in the DICOM format were Goldberg, “Assessment of Osteoarthritis initiative Kellegren and Lawrence Scoring projects quality using computer converted to png format and the knee-joint region in analysis”, Journal of Musculoskeletal Research, Vol. 13, No.4 each image was segmented. After segmenting the pages 197-201, 2010. region of interest, the following features were extracted: [2] H. Oka, S. Muraki, T. Akune, A. Mabuchi, T. Suzuki, § Tamura texture features of contrast, coarseness H. Yoshida, S. Yamamoto, K. Nakamura, N. Yoshimura, H, and directionality. Kawaguchi, “Fully automatic quantification of knee § Haralick texture features computed using the osteoarthritis severity on plain radiographs”, Journal of image’s co-occurrence matrix. Osteoarthritis and Cartilage, 16, pages 1300-1306, 2008. § Gabor textures, using a Gaussian harmonic [3] Hee-Jin Park, Sam Soo Kim, So-Yeon Lee, Noh-Hyuck kernel and different frequencies. Park, Ji-Yeon Park, Yoon-Jung Choi, Hyun-Jun Jeon, “A practical MRI grading system for osteoarthritis of the knee: § Multi-scale histograms of the pixel intensities association with Kellgren-Lawrence radiographic scores”, using 3,5,7 and 9 bins. European Journal of Radiology, pages 112-117, 2013.

INSIGHT-SC [6] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Automatic Complex Event Implementation System Feng Gao, Edward Curry Insight Centre for Data Analytics E-mail: fi[email protected]

Abstract Knowledge Event Request The proliferation of sensor devices and services along with ACEIS Core Base Resource Management the advances in event processing brings many new oppor- Resource Request Resource Discovery tunities as well as challenges. As an example, Smart City Registered Resources QoI & QoS CandidateResources

applications promise to promote the urban life for citizens. Event Service Composer Exception Notifi

However, it is difficult to discover, compose and implement Event Results Composition Plan the event services or service compositions for such applica- Stream Data Federation Management

tions. In this paper, we address this issue by providing an Query Transformer cation automatic complex event implementation system. Constraints Event Request Validation Queries Queries Data 1. Motivation Management Static Data Query Engine

Recent developments in Sensor Networks and Infor- Data Stream mation Communication Technologies are enablers for the Resource Virtualisation & Semantic Annotation Internet-of-Things (IoT). Great opportunities are arising for rendering IoT services in Smart City applications, which Figure 1: ACEIS overview promise to promote urban performances, but we are not there yet. event requests and produces composition plans, a data fed- eration module which consumes composition plans and pro- 2. Problem Statement duces stream queries deployed on stream engines. Smart City applications envision to analyze and react upon realtime, complex events about physical or social en- 6. Evaluation vironments. However, to facilitate these scenarios, it re- We implement a prototype of ACEIS and evaluated it quires dealing with heterogenous data sources and formats, over simulated datasets. The experiment results indicate processing complicated event logics, and addressing users that over 1000 random event patterns with average size over customized requirements. We propose an Automatic Com- 25 nodes, the composition algorithm based on reusabil- plex Event Implementation System (ACEIS) to efficiently ity index takes about 40% of the time required by the un- discover, compose and implement complex events using indexed approach[1]. And the GA based algorithm can cre- services for Smart City applications. ate up to 97% optimal service compositions in much less time, compared to a brute-force enumeration. 3. Research Question In ACEIS implementation, we strive to answer the fol- 7. Related Work lowing research questions. How to describe complex event E-Cube [2] also analyzes the reusability between (se- services (CES) with event patterns? How to efficiently cre- quential) event patterns and organize them into a hierarchy. ate CES compositions based on functional requirements A recent work in [3] use Generalized Component Services (event patterns) as well as non-functional requirements to perform GA over service tasks on different granularity (Quality-of-Service)? levels. However it only caters for IOPE web services.

4. Hypothesis References [1] F. Gao, E. Curry, and S. Bhiri. Complex Event Service Provi- CES can be made first-class citizens by extending OWL- sion and Composition based on Event Pattern Matchmaking. S ontology with event patterns. CES compositions can be In Proceedings of the 8th ACM International Conference on created efficiently using a reusability index based on event Distributed Event-Based Systems, 2014. patterns. The composition algorithm can be extended with [2] M. Liu, et al. E-cube: Multi-dimensional event sequence anal- a Genetic Algorithm (GA) to efficiently fulfill users’ QoS ysis using hierarchical pattern query sharing. In Proceedings requirements of the 2011 ACM SIGMOD International Conference on Man- agement of Data, 2011. [3] Q. Wu, et al. Qos-aware multi-granularity service composi- 5. Proposed Solution tion based on generalized component services. In Service- Figure 1 shows the archtecture of ACEIS. It contains Oriented Computing, volume 8274 of Lecture Notes in Com- mainly a resource management module which consumes puter Science, pages 446–455, 2013.

INSIGHT-SC [7] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014 Bayesian inference for exponential random graph models using composite likelihoods 1,2, 1,2 1,2 Lampros Bouranis ∗, Nial Friel and Florian Maire 1Insight Centre for Data Analytics 2School of Mathematical Sciences, University College Dublin, Ireland ∗[email protected]

Abstract 4. Research Question Exponential Random Graph Models play an important role The main interest of this research is twofold. First, to in network analysis since they allow complex correlation examine the performance of the model parameter estimates patterns between the nodes of the network, hence making under the conditional composite likelihood approximation. themselves more realistic than usual network models. De- Second, to understand the statistical efficiency of the pro- spite their popularity, ERGMs are complicated to work with posed approach, with respect to the way the conditional as they fall into the framework of the doubly intractable likelihood is defined. models, as addressed in [1]. An alternative approach in Previous asymptotic analyses have shown that maxi- the context of block conditional composite likelihoods to in- mum composite likelihood estimation (MCLE) is statisti- fer ERGMs is discussed and comparisons are made with cally more efficient than maximum pseudolikelihood esti- standard approaches in terms of performance. mation (MPLE) [2]. Better approximation is expected to lead to increased computational complexity. Consequently, 1. Motivation the trade-offs between using larger block size (k) with a Exponential random graph models (ERGMs) are a smaller number of blocks (B) have been explored. widely used class of models with applications to social net- work analysis. Such models are typically ruled by a set of 5. Proposed Solution unknown parameters that one is willing to estimate. De- Analysis is performed by implementing a Metropolis– spite their popularity, ERGMs are extremely difficult to within–Gibbs algorithm targeting the conditional compos- handle from a statistical viewpoint, because of the existence ite likelihood approximation. The main contribution of this of intractable likelihoods and normalizing constants. This research lies in: (i) investigation of the quality of the pa- research investigates appropriate methodology for dealing rameter estimates under different approximations and dif- with such issues. ferent networks; (ii) comparison of MCLE to MPLE and the method proposed in [1]. 2. Problem Statement Different approaches have been used for the forma- tion of the blocks, namely: (i) non–overlapping blocks An early approach for overcoming the problem of in- formed by nodes sorted by decreasing order of degree; (ii) tractability has been to approximate the objective function non–overlapping blocks formed by appropriately reordered by the pseudolikelihood method [3]. Maximum pseudolike- nodes to minimise the adjacency matrix bandwidth (after lihood estimation (MPLE) is computationally fast but leads applying the Reverse CutHill–McKee Algorithm), and (iii) to inaccurate or unreliable estimates of the model parame- overlapping blocks that have been investigated through the ters. This approximation relies on a strong (and often un- use of the aforementioned criteria. realistic) assumption: the full conditional independence be- tween the nodes of the network. Another way to perform approximate maximum likelihood estimation is to cast the 6. Discussion inference into a Bayesian framework and to infer the pos- The relationship between block composite likelihood terior distribution of the parameter given the network. Al- and standard approaches for ERG model estimation has though Maximum Aposteriori estimation can provide sim- been examined. However, there still remain some open is- ilar results as MLE, it raises issues such as computational sues regarding the efficient optimization of composite like- complexity. Therefore, inaccuracy of parameter estimation lihoods. Optimally choosing shapes and sizes of the blocks and increasing computational time need to be dealt with. to use in the composite likelihood is still an open area of research. 3. Related Work The theory and application of composite likelihood 7. References methods has received a lot of attention from statisticians. [1] I. Murray, Z. Ghahramani, and C. MacKay. MCMC for However, the Bayesian estimation approach for ERGMs doubly-intractable distributions. In Proceedings of the with the use of block composite likelihoods has not yet been 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), pages 359-366, 2006. fully explored. The solution that is based on the exchange [2] N. Friel, N. Pettitt, R. Reeves, and E. Wit. Bayesian in- algorithm and is proposed in [1] gives rise to new issues. ference in hidden markov random fields for binary data Our work follows up on [2], where superior performance defined on large lattices. JCGS, 18:243-261, 2009. of the Maximum block-pseudolikelihood estimation (MB- [3] D. Strauss and M. Ikeda. Pseudolikelihood estimation PLE) was observed in the context of hidden binary Markov for social networks. JASA, 85:204-212, 1990. random fields.

INSIGHT-SC [8] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Bayesian Low-rank Matrix Completion: General Sampling Distribution and Optimal Rate The Tien MAI, Pierre ALQUIER School of Mathematical Sciences, UCD, Ireland [email protected] ; [email protected]

Abstract Mm p as follow. × The problem of low-rank matrix estimation recently received a significant attention due to challenging applications. A lot 2L , 2L when Γ = 1, U ; V U − K K k,` of work, both on the theoretical and computational aspects, i,` j,` ∼  ([h κ,q κ]) q i when Γk,` = 0 has been done on penalized minimization methods. How- U − with 0 κ 1 L . Where Γ is a r.v s.t. P(Γ = ever, just few papers considered Bayesian estimations and ≤ ≤ n 10K k K k the behaviour of Bayesian algorithms have not been stud- q − (k 1) τ − ied yet. In this paper, we propose an approach for Bayesian Γk := (1,..., 1, 0,..., 0)) = norm.const. for ` = Matrix Completion which leads to the optimal rate of RMSE 1,...,K and some constant τ (0, 1). z }| { z }| { λr(M) ∈ e− (root-mean-square-error). We define the estimator M = M λr dπ e− dπ and proved an oracle inequality for this estimator with 1. Introduction R R general sampling distribution.c This inequality shows Recently, the problem of recovering an uncom- that, whatever the rank of the matrix to be estimated, pleted matrix from a small sample of its entries, called our estimator reaches to the minimax-optimal rate [2] matrix completion, has received many attentions due (up to a log-term) that is to the emergence of some challenging applications; (m + p)rank(M0) log(K) e.g. recommender systems (especially The Netflix . n Prize). 3. Experiments 0 0 Let Mm p be an unknown matrix (expected to be Only 20% entries of the rank-2 matrix M = × 1 0 0 T 0 0 low-rank) and (X1,Y1),..., (Xn,Yn) be i.i.d ran- Um 2(Vm 2) are observed, with Ui,`; Vj,` × × ∼ dom variables (r.vs) such that (0, 20/√m). Then this sample is corrupted by N Y = M 0 + , i = 1, . . . , n. (1) noises which i.i.d (0, 1). We use the above estima- i Xi i N E tor to reconstruct M 0 via a Metropolis-Hasting ker- The noise variables i are independent with E( i) = E E nel. The results are evaluated by measuring RMSE= 0. The variables Xi are i.i.d copies of a r.v X hav- 1 0 2 ing distribution Π on the set X = 1, . . . , m M M F and comparing with Gibbs Sam- { } × mp k − k 1, . . . , p . Then, the problem of estimating M 0 with qpler algorithm. { } n mp is called the noisy matrix completion prob- c m = 100 m = 200 ≤ lem with general sampling distribution. our estimator 0.59 ( 0.027) 0.38 ( 0.012) ± ± The methods used in this model usually base on Gibbs Sampler 0.57 ( 0.003) 0.36 ( 0.001) ± ± minimization methods penalized by the rank or the nu- Table 1: RMSE for K = 5 clear norm of the matrix. A first underlying result was given by Candes` and Tao [1] and efficient algorithms References [1] E. J. Candes` and T. Tao. The power of convex relax- are proposed for example in [4]. ation: near-optimal matrix completion. IEEE Trans. In- Various Bayesian estimators for matrix comple- form. Theory, 56(5):2053–2080, 2010. tion have also been considered which rely on Gibbs [2] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Sampling[5], or Variational Bayes[3]. However, theo- Nuclear-norm penalization and optimal rates for noisy retical properties of Bayesian estimators have not been low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011. examined yet. [3] Y. J. Lim and Y. W. Teh. Variational bayesian approach to movie rating prediction. In Proceedings of KDD Cup 2. Main Result and Workshop, volume 7, pages 15–21, 2007. Assuming that sup M 0 L < + and [4] B. Recht and C. Re.´ Parallel stochastic gradient algo- i,j | ij| ≤ ∞ rithms for large-scale matrix completion. Mathematical the noise variables are sub–exponential (bounded and Programming Computation, 5(2):201–226, 2013. Gaussian as special cases). Let K = min(m, p). With [5] R. Salakhutdinov and A. Mnih. Bayesian probabilistic T M = Um K (Vp K ) , we propose a prior on matrix matrix factorization using markov chain monte carlo. × × In Proceedings of the 25th International Conference on 1independent and identically distributed Machine learning, pages 880–887. ACM, 2008.

INSIGHT-SC [9] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A Bid Profiling Tool for Facebook Advertising Campaigns Ewa Młynarska, Derek Greene, Padraig´ Cunningham Insight Centre for Data Analytics E-mail: [email protected]

Abstract adverts, which are influencing the performance, and then At a time of high popularity of social media networks, ad- uses this model to predict the ongoing performance of a vertising on networks like Facebook has started to become campaign. In contrast to previous papers, our work tries an increasingly popular choice for many brands. The ba- to identify most profitable periods during the week (month) sic aim is to reach as wide an audience as possible for a in terms of bid costs. minimum cost, thereby increasing the return on investment (ROI) for advertisers. To help manage online campaigns, 4. Research Question companies like Flashpoint have developed highly special- The main aim of the current project is to predict the suc- ized tools and services. Working with Flashpoint, we have cess of the ad campaigns in the future. Our tool will need to created tools focused on optimizing ROI for advertisers and provide a profile of the Facebook bid landscape over time, maximizing Facebook advertising campaign effectiveness. providing recommendations about the changes in the cam- paign that should be issued. One of the main questions is 1. Motivation how can we recommend the changes based on dynamic and high volume Facebook data. The key goal for this work is to predict times during an advertising campaign at which certain demographic groups should be targeted to ensure that the campaign is more suc- 5. Hypothesis cessful. Good performance here means a significantly posi- The assumption is that bid fluctuations may be cyclical tive deviation between the market value and the actual value (e.g. daily, weekly, monthly cycles). These changes might paid by an advertiser (e.g. the amount paid in terms of Cost be explained by external factors (e.g. competing campaigns per click (CPC), versus the recommended bid suggested by that overlap with the same audience, major events), or in- Facebook). This would help campaign managers at an ad- herent behaviours of certain demographic groups. vertising agency to decide when to start or stop particular campaigns, or to target certain demographics in order to 6. Proposed Solution minimize cost and maximize audience reach. The idea for the implementation is to develop a web- based tool for Flashpoint to identify key time periods during a campaign at which performance was particularly success- 2. Problem Statement ful. That would allow for the application of similar strate- To advertise on Facebook, the campaign manager needs gies during future campaigns. The ”meta-level” tool will to choose the maximum value to bid for a click on each allow for cross-campaign comparison and building a profile Facebook ”ad group”, which is aimed at a specific demo- for specific types of campaigns, which share common char- graphic, referred to as a ”targeting spec”. The actual final acteristics in terms of their audience. The tool is developed cost per click depends on how the automated Facebook bid- in Java with use of JavaScript D3 for graph creation. The ding system will calculate the value of the appearance of input to the tool is bid data collected from the Facebook the ad. Competition for the same demographic groups and Marketing API, which is supplied by Flashpoint in JSON external factors will affect the cost. It is important that the format. The Key Performance Indicator (KPI) chosen for campaign manager specifies the maximum bid, so as not to the purpose of this project is CPC Median, which provides miss out on clicks, as well as a time range which will give the estimated median bid price (cost per click) for reaching best performance. the target audience. The data may either correspond to (a) an on-going campaign, or (b) a historic campaign that has 3. Related Work already ended and will be used to compare Flashpoint per- Previous studies in the field have tried to identify which formance data (e.g. actual CPC values) with market data factors make users click on the adds[1] and measure from the same time period (e.g. bid estimate values). whether the posting day may have influence on user inter- action campaign’s success. This was based on the com- References ments, likes, interaction duration or reach[2]. This work [1] P. P. S. McCoy, A. Everard and D. F. Galletta. The effects of is the continuation of the Bid Recommender System De- online advertising. Communications of the ACM, 50(3):84– veloped at Clique UCD for Flashpoint. The task there was 88, 2007. [2] G. A. Soonius. Facebook strategies: How to measure cam- to build an automated tool that assesses the performance of paign success. Master’s thesis, 2012. the adverts, builds a statistical model for components of the

INSIGHT-SC [10] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Building a User Profile from Automatic Human Activity Recognition in Lifelogging, for a Personal Recommender System Stefan Terziyski, Rami Albatal, Cathal Gurrin Dublin City University {stefan.terziyski, rami.albatal, cathal.gurrin}@insight-centre.org

Abstract framework may not be applicable in practice, since pre- Our research focuses on Human Activity Recognition trained SVM object classifiers are used and due to the (HAR) in Lifelogging. The aim is to develop an nature of lifelog data, it is unknown what objects have automatic classification model for human activities, to be detected per user. A solution for that is providing from multi-modal lifelog data – images, accelerometer classifier training on the fly, but this cannot be done and GPS data. The availability of one’s lifelog provides automatically. ground for building a rich user profile and hence basis for an advanced personal recommender system. 5. Hypothesis For HAR, visual data, accelerometer data and geo- 1. Motivation location data, should be sufficient to determine the Recent technological development in personal and activity, which the user is performing at the given ubiquitous computing has provided the conditions for sensory readings. Once a classification model is end-user lifelogging. Recommendation systems rely on established, it can provide automatic annotation of the user profiles in order to make recommendations to end- user’s data and hence provide a user profile with the users, but constructing a user profile remains a user’s activity per day with start and end time per challenge. Lifelogs gathered through personal activity window. That in itself constitutes a good computing can be a solution to this problem, as they are platform for recommendations. a valuable source of information for building detailed and rich user profiles. 6. Proposed Solution Global Image descriptors, e.g. Color Histograms can 2. Problem Statement be analysed with Discriminative models, such as The main challenge is Human Activity Recognition Support Vector Machines to filter out the images and (HAR). Whereas this is a well-formulated problem in provide the most appropriate candidate activities. Then surveillance and monitoring, it is not well defined for the second stage is to extract local image descriptors visual signals that originate from the users’ point of e.g. Histogram of Oriented Gradients and employ them view, i.e. the view that user sees. This can be also with Generative models, such as Hidden Markov interpreted as visual context. So instead of looking at Model, to produce the final activity classification. what a person does and try to identify the activity, the The second stage is where other sensory data may be goal is to recognize the activity of that person given fused, particularly accelerometer data. his/her surrounding environment. Due to unavailability of video streams in lifelogging, spatio-temporal 7. Discussion information is lost. However, a user may be performing Our approach is to have an extensive multi-modal an activity, while having the same view as when data collection across 5 users for at least a month each. performing another. Then the idea is to provide a user independent classification model, which should be verified over the 3. Research Questions collected data. The research questions are: How can similar views or sceneries be detected? How can the different sensors Acknowledgements be fused together to provide a reliable activity This publication has emanated from research classification? conducted with the financial support of Science Foundation Ireland (SFI) under grant number 4. Related Work SFI/12/RC/2289. Several works in the state-of-the-art use personal sensing data to recognize activities. Among them, in [1] 8. References the authors have used a combination of global image [1] J. Hamm, B. Stone, M. Belkin, S. Dennis “Automatic descriptors (color histogram and edge orientation Annotation of Daily Activity from Smartphone-based histogram) with FFT accelerometer features, 2D GPS Multisensory Streams” Fourth International Conference on data and subset of MPEG-7 audio descriptors; and good Mobile Computing, Applications and Services, 2012 [2] P. Wang, A.F. Smeaton, “Using Visual Lifelogs to results are obtained from the user of an SVM-HMM automatically characterize everyday activities” Information classifier. Another successful approach presented in [2] Sciences , Elsevier, 10 Jan 2013. relies solely on visual information. However, this

INSIGHT-SC [11] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Cardiorespiratory Fitness and Vascular Function in Youth Sinead E. Sheridan and Niall M. Moyna

The Insight Centre for Data Analytics, Dublin City University, Dublin [email protected]

Abstract Echo-Doppler Imaging Low fit (LF) and high fit (HF) healthy adolescent cIMT was assessed using a 12.0 MHz linear-array boys had a blood sample taken and their body transducer. Measurements were conducted in the 10- composition, carotid intima media thickness mm linear segment of the near and far walls of the common carotid artery and averaged. Brachial artery (cIMT), endothelial function (EF) and V̇ O2max measured. Body mass index, % body fat, blood FMD was measured as previously described (1). pressure, fasting glucose, triglycerides, non-high Results density lipoprotein cholesterol and cIMT were The table summarizes the physiological, blood pressure, significantly higher in LF than HF. EF was lower physical fitness, Serum lipids, glucose and cIMT (p<0.05) in LF than HF. There was a significant results. ̇ relation between VO2max and EF and a significant Low Fit High Fit P value inverse relation between V̇ O2max and cIMT. Height (cm) 180.4±5.0 174.1±0.4 0.006 Mass (kg) 90.0±16.3 62.9±5.1 0.001 Introduction BMI (kg.m2) 27.7±5.0 20.7±1.5 0.003 Body fat (%) 26.8±8.9 7.9±2.1 0.000 Cardiovascular disease refers to disease of the heart and SBP (mm Hg) 136.4±8.8 111.9±6.1 0.000 blood vessels and is the leading cause of mortality in DBP (mm Hg) 82.4±5.5 75.1±5.0 0.003 Ireland. CVD begins in childhood and adolescence due -1 -1 V̇ O2max (ml.kg .min ) 40.5±1.77 64.0±3.9 0.000 primarily to exposure to lifestyle mediated risk factors. Triglycerides (mg.dl-1) 117.9±54.3 51.7±20.4 0.006 Endothelial dysfunction induced by CVD risk factors is Cholesterol (mg.dl-1) 117.9±54.3 139.8±33.2 0.268 one of the earliest events in the development of CVD LDL-C (mg.dl-1) 99.6±21.2 82.3±24.2 0.094 and precedes structural changes in the artery wall such HDL-C (mg.dl-1) 39.1±6.7 46.6±12.6 0.116 as intima media thickness. Exercise training has been Non-HDL (mg.dl-1) 115.5±22.3 93.3±26.5 0.050 shown to restore EF, decrease cIMT and improve CV Glucose (mmol.L-1) 4.5±0.4 4.2±0.3 0.048 risk in obese children. There is currently no research Right CCA (cm) 0.06±0.01 0.04±0.00 0.000 that has examined subclinical atherosclerosis in Left CCA cIMT (cm) 0.05±0.01 0.04±0.00 0.001 asymptomatic adolescents with low and high Values are mean±SD cardiorespiratory fitness (CRF). This study compared Endothelial dependent dilation (EDD) was lower CVD risk factors, cIMT and EF in adolescent boys with (p<0.05) in LF than HF (Fig 1). There was a positive low and high CRF and examined the relation between relation between EDD and V̇ O max (Fig 2). CRF and cIMT and CRF and FMD in asymptomatic 2 adolescent boys. It was hypothesized that boys with a high CRF would have a healthier CV profile and that there would be and inverse relation between CRF and cIMT and CRF and EF. r= 0.86 (p<0.001) Methods Figure 1: % change in Endothelial Dependent A total of 9 low fit (15.89±0.60 yr) and 14 high fit Dilation and Endothelial Independent Dilation inFigure 2: Relation between EDD and VO2max (15.86±0.37 yr.) participated in the study. low and high CRF group Right and Left cIMT was significantly higher in LF Selected CVD Risk Factors than HF. There was a significant relation between Fasting blood samples were obtained using standard V̇ O2max and near (r = 0.65; p<0.001) and far (r = -0.77; venipuncture. Resting systolic BP and diastolic BP was p<0.001) right wall cIMT, and between V̇ O2max and recorded. Skinfold calipers were used to measure near (r = -0.73; p<0.001) and far (r = -0.74; p<0.001) double thickness subcutaneous adipose tissue. left wall cIMT.

Cardiorespiratory Fitness (V̇ O2max) References A ramp treadmill protocol with open circuit spirometry [1] Corretti, MC et al, J Am Coll Cardiol. 2002 Jan 16;39(2):257-65. was used to determine V̇ O2 max.

INSIGHT-SC [12] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

The Classification and Evaluation of Exercise Technique using Inertial Sensors and Depth Cameras Martin O’Reilly; Darragh Whelan; Tomás Ward; Brian Caulfield; Insight Centre for Data Analytics, University College Dublin [email protected] Abstract 4. Hypothesis Resistance training is an important component of It is expected that both the inertial and RGBD sensor athletic performance and general health. However, technologies will show capability in the analysis of poor training technique during exercise may lead to gym-based exercise. In order to develop a practical and reduced training effectiveness and an increased injury cost-effective solution for gym-goers it is desirable to risk. Recent developments in Inertial Measurement utilise techniques to maximise the amount of Units (IMUs) and Red Green Blue Depth (RGBD) meaningful data which can be extracted from a minimal cameras may allow for individualised feedback systems sensor set. to prevent deviations from correct exercise technique. This innovative work aims to use low cost sensor 5. Proposed Solution technologies in order to analyse human movement and Initially a suitability study involving 22 weight- develop feedback systems to prevent breakdown from trained participants has been conducted to ascertain optimal exercise form in the gym. whether IMUs and RGBD cameras can be used to classify exercises and detect deviations from correct 1. Motivation technique. The 5 body-worn IMUs were positioned at Resistance training can result in improvements in anatomical markers on the participant’s lumbar region athletic performance and general health. However poor of their spine, left shank, right shank, left thigh and technique during resistance exercises can result in right thigh. Acceleration and angular velocity were biomechanical inefficiencies, leading to reduced recorded at 51.2Hz from each of the IMUs during all training effectiveness and increased risk of injury. exercise performance conditions. Maximum and These mistakes are minimised by employing trained minimum acceleration, angular velocity, pitch and roll professionals to aid with technique evaluation. However (in X, Y, Z) and the range of each was calculated for limitation of resources mean it is not always possible to each condition. A paired t-test was used to analyse train under the supervision of an expert trainer. whether there was a difference in the parameters To date, people undertaking resistance training between the different exercise conditions. This process programmes alone have had to rely on self-observation has highlighted many statistically significant changes in and memory to adhere to correct exercise technique. the data for various exercise conditions. These athletes may make critical errors whilst training. These mistakes may hinder performance progress and 6. Future Work heighten the risk of injury. Recent developments in A variety of signal processing and machine learning IMUs and RGBD camera technologies have the techniques will be implemented and compared to potential to address these issues and enhance maximise the value of a minimal sensor set which can implementation of resistance exercise training. classify and evaluate exercise technique. This analysis will build on work already completed within Insight by 2. Related Work Gowing et al. [1] and Giggins et al. [3]. A number of studies have analysed the use of inertial This will allow for the development of low-cost sensors and RGBD cameras in exercise training. This biofeedback systems for gym-goers which can provide includes work on exercise recognition [1], biofeedback valuable and motivating feedback to the users, similar [2], and rehabilitation systems [3]. to that which a professional strength and conditioning This study focuses primarily on the evaluation of coach would provide. movement patterns and utilises multiple sensor technologies in order to identify and evaluate the References benefits and capabilities of a variety of sensor sets. [1] M. Gowing, A. Ahmadi, and F. Destelle, “Kinect vs. Low-cost Inertial Sensing for Gesture Recognition,” 3. Problem Statement Multimed. …, pp. 1–12, 2014. [2] O. M. Giggins, U. M. Persson, and B. Caulfield, Cost and availability issues mean expert guidance is “Biofeedback in rehabilitation.,” J. Neuroeng. not available to all who wish to participate in resistance Rehabil., vol. 10, p. 60, 2013. training. This commonly results in poor exercise [3] O. Giggins, D. Kelly, and B. Caulfield, “Evaluating technique. The consequences of this are the Rehabilitation Exercise Performance Using a Single development of injuries and the inability to reach Inertial Measurement Unit,” in 7th International specific strength related goals efficiently. Conference on Pervasive Computing Technologies for Healthcare and Workshops, 2013, pp. 49–56.

This project is funded by the Irish Research Council as part of a Postgraduate Enterprise Partnership Scheme with Shimmer (EPSPG/2013/574).

INSIGHT-SC [13] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Considering Uncertainty in Data - A Practical Example from the Forestry Industry Adejuyigbe O. Fajemisin University College Cork E-mail [email protected]

Abstract pacity array, a normal distribution was generated with the Uncertainties such as a lack of knowledge of available mean at x, and the standard deviation at 0.05x. A random stock make forestry problems stochastic. A scenario-based variable xˆ was then sampled from this distribution for use in stochastic model with a service-level constraint was formu- the calculations. The SAA method was afterwards used to lated in order to optimize the harvesting of a forest and sat- generate a much smaller number of representative scenarios isfy customers demand. (from 100 to 1000 scenarios) and the problem was solved.

1. Motivation 6. Results Real-world problems have associated unknown parame- The results were compared with the case in which the ters and this lack of knowledge is known in general terms as forest capacity is known completely, i.e. the determinis- “uncertainty”. This work focuses on the uncertainty in the tic case. Using this approach, the problem size was made forest capacity due to factors such as weather, pests, mea- more manageable, and the inclusion of the service-level surement errors, amongst others. constraint in the model ensured a certain level of robustness of the obtained solutions. 2. Related Work In terms of the effect of the uncertainty in forest capac- Stochastic programming, dynamic programming, opti- ity on the forestry company, it was seen that this uncertainty mal control theory, chance-constrained programming as resulted in a 6 to 7 % increase in the costs incurred by the well as fuzzy set theory have all been used to address the company. This is shown in figure 1 below. problem of uncertainty in forestry [1]. Robust Optimization has been used to analyse the problem of harvest scheduling, and to scheduling production in a sawmill in [2]. 3. Problem Statement

The goal is to optimize both the harvesting of the forests, as well as associated costs, in order to satisfy customer de- mand despite the uncertainty in forest capacity. A stochastic programming approach will be taken in order to solve this problem. Scenario-based stochastic problems can be very large and take long to solve. Consider a forestry company with 7 forests and 4 stands in each forest, with each stand Figure 1: Cost of uncertainty to the company producing 3 different types of wood products. There are therefore 7 4 3 = 84 quantities. If there are 3 har- In conclusion, this approach allows the company to easily × × vest scenarios (good, bad and average yields) then there are consider the effect of the uncertainty on the costs incurred, 384 = 1.197 1040 possible scenarios. and helps them to build some robustness into their supply of × products to customers, in order to satisfy as many customers as possible. 4. Solution Solving a problem with 1.197 1040 possible scenarios × is unrealistic, therefore a type of Monte Carlo technique - References Sample Average Approximation (SAA) - was used to solve [1] A. S. Kangas and J. Kangas. Probability, possibility and evi- the problem in order to see if it would produce a good ap- dence: approaches to consider risk and uncertainty in forestry proximate solution. A service-level constraint was also im- decision analysis. Forest Policy and Economics, 6(2):169 – 188, 2004. plemented in order to provide some robustness to the solu- [2] M. Varas, S. Maturana, R. Pascual, I. Vargas, and J. Vera. tion. Scheduling production for a sawmill: A robust optimization approach. International Journal of Production Economics, 5. Evaluation 150(0):37 – 51, 2014. In order to simulate the effect of a 5% uncertainty level in the forest capacity, for each variable x in the forest ca-

INSIGHT-SC [14] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Data Analytics For Assessing Financial Incentives Yulia Malitskaia and Barry O’Sullivan University College Cork E-mail: yulia.malitskaia, barry.osullivan @insight-centre.org { }

Abstract To validate a model’s performance it is essential to The ePolicy Project decision support system represents a apply the approach on a dataset that captures the complete global multi-objective optimization framework that takes pattern: growth, plateau, drop, and slight recovery. Mean- into account several categories of explicitly-defined require- while, previous works have only focused on the the growth. ments and the simulation-based effects of financial incen- tives. This paper aims to enhance this approach by devel- How to identify significant features for describing the oping an efficient analytical approach. In the context of the heterogeneity within the model? ePolicy Project, this approach can act as a rational method for benchmarking the simulation models and be directly in- To accurately calculate the unique parameter term corporated into the global optimization solver. and overcome the omitted variable bias it is necessary to identify important region-specific features. 1. Motivation With carbon dioxide emissions rapidly increasing on a 5. Hypothesis and Proposed Solution global scale these levels pose a grave concern as a potential We hypothesise that an analytical model is essential for precursor to a continued and uncontrolled rise of tempera- determining the effectiveness of feed-in tariffs within Italy. tures worldwide. Therefore, by understanding what drives To implement this task we propose to develop a hybrid ap- photo-voltaic (PV) installations and how much the govern- proach based on econometric and data mining techniques ment should allocate towards financial incentives is essen- for quantifying financial incentives and identifying impor- tial for establishing renewable energy sources (RES) as the tant features that influence PV installations. dominant source of energy and helping to prevent climate change. 6. Evaluation The developed piecewise linear regression model suc- 2. Problem Statement cessfully captured the effectiveness of feed-in tariffs on The European Commission funded ePolicy Project’s the deployment of PV installations in Italy from 2008 to (Grant Agreement 288147) decision support system was 2013. The model is then further validated on a region-based created within the context of developing financial incen- dataset using the t-test, panel data models, and the Chow tives for promoting PV installations within the Italian re- test. The results confirm the applicability of the model gion. One of the most important inputs of the system is to other regions and that there is heterogeneity in the pa- provided by the agent-based simulator [1] – the financial rameter terms among the regions. To identify the impor- incentives. This project develops an efficient data analyt- tant features causing this effect, two data mining techniques ics approach to benchmark the simulation models and be were employed for clustering similar regions together: self- directly incorporated into the global optimisation solver. organizing maps (SOM) and k-means. Both data mining approaches calculated similar results and identified a few 3. Related Work interesting clusters. Papers to date have begun laying down the foundation for tackling this task by developing new models based on References econometric tools. Jenner et al., for instance, proposed [1] A. Borghesi, M. Milano, M. Gavanelli, and T. Woods. Sim- a new metric for analysing the financial performance of ulation of Incentive Mechanisms for Renewable Energy Poli- feed-in tariffs [4]. Creti and Joaug showed the associa- cies. Technical report, ePolicy Project, 2013. [2] A. Creti and J. Joaug. Let the sun shine: optimal deployment tion between the empirical regression model and discrete of photovoltaics in Germany. HAL Working Paper Series, choice approach based on the utility function [2]. Further- 2013. more, Favuzza and Zizzo conducted a descriptive compar- [3] S. Favuzza and G. Zizzo. The new course of FITs mechanism ison among several metrics in the context of Italian energy for PV systems in Italy: novelties, strong points and criticali- plans [3]. ties. Policy Issues, 2011. [4] S. Jenner, F. Groba, and J. Indvik. Assessing the strength and effectiveness of renewable electricity feed-in tariffs in Euro- 4. Research Questions pean Union countries. Energy Policy, 2013. How to capture the full PV panel deployment pattern?

INSIGHT-SC [15] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A Data Driven Approach to Determining the Influence of Fatigue on Movement Variability and Associated Injury Risk in Chronic Ankle Instability Alexandria Remus, Eamonn Delahunt, Kevin Sweeney, Brian Caulfield University College Dublin [email protected]

Abstract the environment increasing the susceptibility of sustaining There is a high prevalence of chronic ankle instability repeated sprains. (CAI) following an initial sprain and a low understanding of its onset. The use of 3D inertial sensors may provide a 6. Proposed Solution better understanding of the implications of fatigue on The recent development of wireless sensors allowed us inappropriate movement patterns during running gait in to combine a standard sports science method with a data those with CAI and their relation to its onset. driven, analytical approach to solve a clinical problem that presents itself in a real life sporting situation. Our novel 1. Motivation approach utilised six 3D SHIMMER® inertial sensors Lateral ankle sprains are one of the most common attached to each thigh, shank, and foot to collect lower injuries suffered by athletes in sports. It is estimated that limb movement profiles of participants with and without an upwards of 70% of athletes will develop chronic ankle CAI as they became fatigued throughout the course of the instability following an initial ankle sprain. Despite the Yo-Yo Intermittent Recovery Test level 1 (Yo-Yo IR 1) high prevalence of CAI, knowledge of the mechanism or running protocol. prevention of repeated ankle sprains is limited. 7. Discussion 2. Problem Statement The utilisation of the Yo-Yo IR1 as the running Previous studies have identified common gait protocol will allow us to compare lower limb movement alterations in individuals with a history of ankle sprain profiles both prior to and throughout a fatigued state. The which are hypothesised to contribute to CAI. However, Yo-Yo IR1 consists of two sequential 20 m runs at the extent to which these inappropriate movement patterns increasing speeds each separated by a turn. Events are influenced by the onset of fatigue has yet to be occurring around the time surrounding the turn are our determined. Since most sprains occur in the latter halves main focus, as the turn most simulates a time in which an of matches, the purpose of this study was to determine the athlete is susceptible to suffer a sprain. effects of fatigue on lower limb movement variability in The use of six SHIMMER® sensors provided lower individuals with and without CAI during running gait. limb movement profiles of individuals as they underwent the running protocol. Simultaneous recording with a high 3. Related Work speed camera has allowed us to locate the turning point of When looking at variables associated with anterior the run in the data. Future analysis will aim to see if certain cruciate ligament (ACL) injury of the knee, the ability of characteristics in these movement patterns present a person to perform smooth and controlled movements themselves at key moments during a turn as well as how was found to be limited under the effects of fatigue [1]. these characteristics change under the influence of fatigue. This loss of control and coordination increases the The characteristics of each turn will be compared within likelihood of ACL injury. individuals throughout the duration of the Yo-Yo IR1 as Quantification of gyroscope features from inertial well as between those with and without CAI. sensors have shown to successfully identify changes in ACL reconstructed gait that otherwise would have gone unnoticed with traditional temporal and spatial gait measures [2].

4. Research Question Does fatigue influence lower limb movement variability in those with chronic ankle instability during Figure 1: Locating initial contact and push off of running gait? a turn from total acceleration data References 5. Hypothesis [1] N. Cortes et al. Differential effects of fatigue on We hypothesised that with the onset of fatigue, movement variability. Gait & Posture, 39(3):888-93, participants with CAI would be characterised by a reduced March 2014. level of movement variability while running compared to [2] M. R. Patterson et al. An ambulatory method of those without CAI. This decreased variability may be an identifying anterior cruciate ligament reconstructed gait indication of an inability to adapt to external stresses from patterns. Sensors, 14: 887-899, January 2014.

INSIGHT-SC [16] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

DWARF Compression and NoSQL: The future of XML Analytics Michael Scriney & Mark Roantree Dublin City University E-mail michael.scriney, mark.roantree @insight-centre.org { } Abstract 4. Research Question OLAP Cubes are widely used as a means of analysing data. How can XML be analysed by first, creating an XML An OLAP Cube is a series of cross-tab queries which con- cube using DWARF compression and how can the resulting tains all the information for a set of queries in a data ware- cube be stored and queried at a later stage. house. However datacubes are based off information ob- tained from a relational database. XML is a widely used 5. Hypothesis markup language used to store a wide variety of information Creating an XML cube using DWARF compression [4] from documents to web-data. Such cubes have problems would greatly reduce the overhead as opposed to a tradi- with construction times, storage and querying. DWARF is a tional OLAP Cube. These cubes can then be stored in a cube compression method which boasts a 1:400000 storage scalable NoSQL environment such as Cassandra [3] which reduction ratio under certain conditions [4]. Combining would provide the ability to store XML cubes in a fast scal- this storage reduction with a fast scalable NoSQL database able system with a low storage overhead with improved such as Cassandra [3] would create a fast XML analytics query times. system.

1. Motivation 6. Proposed Solution Static OLAP Cubes are the main tool used in Decision The proposed solution would be to create a system which Support Systems (DSS) [1]. However their long construc- creates DWARF cubes from XML data. These cubes would tion times and resulting size poses many problems. In ad- then be stored in Cassandra [3]. A query engine will then be dition, the focus of an OLAP Cube has primarily been on created to analyse the XML cubes stored in Cassandra. This relational data provided by a Relational Database Manage- would then provide the ability to perform high-level data ment System (RDBMS). It is possible to convert XML data mining operations (k-means, classification etc..) on XML into a relational format to analyse and create a cube, how- cubes. ever this would only further increase the cube construction time and in addition, lose the rich metadata provided by the 7. Evaluation structure of the XML document. The systems performance will be evaluated by creating XML OLAP Cubes. Their construction times will then be 2. Problem Statement compared to an XML cube created using DWARF compres- Ideally a DSS would be able to instantly update and sion. Their storage overhead will be recorded. Their query make recommendations based on XML data and have a low / access times will then be compared by performing high storage overhead however the nature of OLAP cubes makes level data-mining operations on each cube. this impossible. The initial conversion process from XML to relational data would add to the construction time and References lose the metadata associated with the documents structure. [1] F. Dehne and H. Zaboli. Parallel real-time olap on multi-core Cube creation time prevents new information from being processors. In Proceedings of the 2012 12th IEEE/ACM Inter- known to the DSS immediately and their size makes stor- national Symposium on Cluster, Cloud and Grid Computing ing and accessing the valuable data inside the cube a slow (ccgrid 2012), pages 588–594. IEEE Computer Society, 2012. process. This makes the process of analysing high-velocity [2] M. R. Jensen, T. H. Møller, and T. B. Pedersen. Specifying web-data very difficult. olap cubes on xml data. Journal of Intelligent Information Systems, 17(2-3):255–280, 2001. [3] A. Lakshman and P. Malik. Cassandra: a decentralized struc- 3. Related Work tured storage system. ACM SIGOPS Operating Systems Re- Numerous work has been presented on optimizing view, 44(2):35–40, 2010. OLAP Cubes, e.g. [1] talks about using parallelization to [4] Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Ko- create a real-time OLAP system. However this approach tidis. Dwarf: Shrinking the petacube. In Proceedings of the only dealt with an in-memory system and did not approach 2002 ACM SIGMOD international conference on Manage- the topic of storing the resulting OLAP Cubes. ment of data, pages 464–475. ACM, 2002. [2] talks about creating OLAP Cubes from XML data, however the underlying problems associated with OLAP Cubes are not addressed.

INSIGHT-SC [17] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Dynamic User Authentication Based on Mouse Movements Curves Zaher Hinbarji, Rami Albatal and Cathal Gurrin Dublin City University, Dublin {zaher.hinbarji, rami.albatal, cathal.gurrin}@insight-centre.org

Abstract is approximated by a normalized histogram computed In this research we introduce a new behavioral using curves belonging to a user. The histograms of biometric approach based on mouse movements only by different features belonging to a certain user form the using normal mouse devices. We focus on the properties signature of that user. Finally, identifying users is done of the curves generated from the consecutive mouse using artificial neural networks. Each user has his/her positions during our typical mouse movements. own neural network classifier, which is trained to Experimental results show that the normal user recognise his/her signature alone. The final recognized interaction with the computer via mouse devices entails user would be the one corresponding to the neural behavioral information with discriminating features, network with the highest output. which can be explored for user authentication. Our approach is similar to the one followed by [3]. However, the two approaches differ in two major 1. Motivation aspects. First, we use different features and emphasize A common restriction of most biometric systems is on using those that satisfy certain mathematical the need for special devices for data capturing. Mouse properties related to task-independence. Second, [3] dynamics is an exception, since it can be implemented uses a single 'reference' signature per user. The using a regular mouse. In contrast to the static Euclidean distance between the evaluated signature and authentication in which user identity is checked once, reference signature is then used to validate the user. Our dynamic verification checks the user continuously over method, on the other hand, uses a neural network per the session that can effectively prevent session user, which is trained using multiple signatures. The hijacking. The previous two features make a reliable neural network classifier has the potential of better mouse based authentication system highly valuable. recognition by generalizing from different signature training samples. 2. Research Overview Our main challenge is to extract features that can 5. Evaluation reflect the user behavior patterns regardless the task that In order to evaluate our approach we conducted an she/he is performing. Our underlying hypothesis is that experiment in which 10 participants are involved. We mouse curves have such user-specific information. In equally divided our dataset into training and testing our research, we are exploring to what extent the subset. Training the classifiers on the training subset generated mouse curves can discriminate users. and then evaluating them on the testing one yielded an average EER of 5.3%, which is better than similar state- of-the-art methods that achieved 11.2% [3]. However, 3. Related Work this is still an early evaluation and further tuning is Several mouse dynamics approaches have been going on. proposed in the literature, presenting different types of features. The approach in [1] used the distance, angle and speed between pairs of data points as raw features 6. Acknowledgments which then used to produce their mean, standard This publication has emanated from research deviation and the third moment values over a window conducted with the financial support of Science of N data points. In [2], user’s mouse signature is Foundation Ireland (SFI) under grant number generated from different features based on movement SFI/12/RC/2289 speed, movement direction, and traveled distance. While in [3], the length, curvature and inflection points 7. References of the curve are used as main features. [1] M. Pusara, C.E. Brodley, “User re-authentication via mouse movements”, 2004 ACM workshop on Visualization and data mining for computer security, pages 1–8, 2004. 4. Mouse Behavior Modeling [2] A.A.E. Ahmed, I. Traore, “A new biometric technology Our objective is to identify users by analysing based on mouse dynamics”, IEEE Transactions on features extracted from mouse curves. The exact values Dependable and Secure Computing, 2007 of those features may vary considerably even for curves [3] D. Schulz, “Mouse curve biometrics”, Biometric of the same user. However, we assume that each feature Consortium Conference, 2006 Biometrics Symposium, pages follow a probability distribution that is unique to each 1–6. IEEE, 2006 user and can serve as a signature of his/her mouse movements. The probability distribution of each feature

INSIGHT-SC [18] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Energy Efficiency in Smart Homes Oscar Manzano Torre University College Cork [email protected]

Abstract 4. Research Question Using cutting-edge technology in Smart Homes does Can we find a useful way to present complex data to not offer any guarantee of using energy efficiently. allow the users to understand their energy consumption Being energy-aware is important but getting useful and to reduce their energy bills? feedback about how energy is consumed at home is the key to becoming energy efficient. In this regard, we are 5. Hypothesis implementing a new approach to providing feedback, Presenting data with multiple configurable levels of aiming to change household behaviour to achieve more detail tailored to individual users allows them to efficient use of energy. Based on sensor data, our identify the information that motivates them to change approach uses an interactive tablet app to report their behaviour. Furthermore, it will improve energy detailed energy consumption, householder behaviour awareness and thus reduce energy inefficiencies. In patterns, and tailored advice on energy use. turn, it will decrease the energy bill for householders.

1. Motivation Energy inefficiencies in the home might be unknown 6. Proposed Solution by households due to a lack of energy-awareness [1]. Constructed as part of the Authentic architecture and Energy use is important and reducing energy waste has prototype deployment, an App using multi-touch a local and global benefit. In 2006 Ireland’s electricity interface on Android tablets has been developed. The usage was 20% above the average for the UK, and in App offers multiple reports categorized by cost, 2005, 29% above the average for the EU-27 [2]. consumption and usage; users can customize them by The Authentic [3] project pursues to design a mobile appliance or room selecting custom periods of time. version of an energy management system within the The App instantly accesses live data feeds providing home based on opportunistic decision making. detailed reports to the users regarding their energy In this project, a set of sensors have been deployed usage and their behaviour patterns. It also allows the in real homes to measure energy consumption and users to set up their energy goals and track their report households’ behavioural patterns. The deployed position across the year. In addition, the App is able to sensors transmit the readings wirelessly into a database guide users to actions tailored to their preferences and ranging from every few seconds to 5 minutes all day. goals based on a recommender system which is under What can we do with such an amount of development and will be plugged in. information? What can we build out of it? In brief, it can be said that household energy habits can be 7. Evaluation analysed, energy inefficiencies can be detected and The sensors and App are being deployed in a group useful feedback can be provided. However, households of homes, and the effectiveness of the App is being need help in understanding the data and guidance evaluated. The App integrates a tracking system that towards better energy use. feeds into the database which reports are used by the users. In this manner, we can identify which reports are 2. Problem Statement important for them as well as which information they The set of sensors deployed create large amounts of are sensitive to get feedback from. Historical data will data and it is not obvious how the data can be presented allow us to assess the efficacy of this new approach. in a way that is useful to the householder. Individual beliefs, behaviours and goals make the problem 8. References complex. [1] Owen, P., Powering the Nation: Household electricity- using habits revealed . Energy Saving Trust, London, 2012. [2] Energy in the Residential Sector Report, Sustainable 3. Related Work Energy Ireland, 2008 Authentic offers an Android App suggesting tailored [3] Autonomic Home Area Network Infrastructure actions to households based on preferences and goals. (http://www.ierc.eu/research-projects/home-area-networks/ ) Other variety of feedback including ambient canvas art [4] Rodgers, Johnny and Bartram, Lyn, ALIS: an interactive is explored in ALIS [4]. PowerCost [5] and Kill A Watt ecosystem for sustainable living , ACM, Proceedings of the EZ [6], among other commercial products, are able to 12th ACM international conference adjunct papers on show energy use in real time, although do not transform Ubiquitous computing-Adjunct, 2010. [5] PowerCost Monitor (http://www.bluelineinnovations.com ) the information into something more meaningful and [6] Kill A Watt EZ user-friendly that can help their users becoming more (http://www.p3international.com/products/p4460.html ) energy efficient.

INSIGHT-SC [19] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Entity Linking with Multiple Knowledge Bases Bianca Pereira National University of Ireland, Galway [email protected]

Abstract 1. What are the generic features (textual and from KB) The amount of textual data available in enterprise most relevant for EL? databases and on the Web is increasing over time. Entity 2. Is it feasible to have EL using multiple KBs with com- Linking has been used as an approach to extract value from parable performance to KB-specific approaches? this data by recognizing entities appearing in text. This pa- per covers the use of multiple knowledge bases for Entity 5. Hypotheses Linking and summarizes our current results. The hypotheses related to our research questions are: H1. Noun phrases and verbs can be directly mapped to 1. Motivation entities and relationships in the KB. Natural Language Understanding is the basis of a series H2. Verbs and noun phrases are sufficient to measure the of applications dealing with text. The first step in such un- compatibility mention-KB entry. derstanding is the recognition of entities (People, Location, H3. The division of the KB in context-specific modules Organization, Musics, Movies, etc.). This task has been ad- enables comparable performance with KB-specific EL. dressed through the use of Entity Linking (EL), i.e. the link- ing of mentions in text and Knowledge Base (KB) entries 6. Proposed Solution referring to the same real-world entities. The proposed solution is composed by three steps: men- tion recognition, candidate selection, and disambigua- 2. Problem Statement tion. The mention recognition step recognizes mentions in Cross-domain KBs (e.g. Wikipedia) have been the main text using the KB as a dictionary of entities (H1). The can- source of entities for EL approaches due to the broad num- didate selection step deals with Big Data issues by selecting ber of domains they cover. Even so, the use of multiple only a subset of KB entries as candidates for linking with domain-specific KBs enables the recognition of a higher each mention (H3). Finally, the disambiguation step uses number of entities. For instance, Wikipedia contains data 1 textual and KB features to decide which candidate entry is about 5 million entities while Internet Movie Database more suitable to be linked with each mention (H2). covers more than 8 million entities only in the cinema do- main, and Music Brainz 2 covers more than 30 million en- tities in the music domain. Together they have more than 7. Results and Future Work seven times the number of entities in Wikipedia. In order to use multiple KBs, our first attempt was in The problem addressed in this work is the Entity Linking the development of an algorithm that works with KBs using with Multiple Knowledge Bases. generic schemas. For this, we adapted one state-of-the-art algorithm to use with LD datasets. This approach presented f-score of 0.54 and 0.87 using two different KBs [3]. Next, 3. Related Work we used DBPedia as KB in order to compare with related Major work in EL is conducted using a single KB. To work in EL, but the same method underperformed all other the best of our knowledge, only [1] and [4] focus on EL techniques [2]. This experiment showed that, as a future using more than one KB. The first assumes there are textual work, an evaluation at feature level is required in order to description for entities in the KB in order to perform a TF- discover why one method is better than others. IDF ranking. The latter solves the problem using a schema- independent approach and focus on the use of Relational References Databases as KBs. [1] G. Demartini, D. E. Difallah, and P. Cudre-Mauroux.´ Zen- In our work we focus on the use of Linked Data (LD) Crowd: Leveraging Probabilistic Reasoning and Crowdsourc- datasets as KBs and generic textual and LD features. ing Techniques for Large-Scale Entity Linking. In Proc. of WWW 2012, 2012. [2] Eurosentiment. WP4. 2014. 4. Research Questions [3] B. Pereira, N. Aggarwal, and P. Buitelaar. AELA: An Adap- The problem of EL with multiple KBs leads to the fol- tive Entity Linking Approach. In Proc. of WWW ’13, 2013. lowing research questions: [4] A. Sil, E. Cronin, P. Nie, Y. Yang, A.-M. Popescu, and A. Yates. Linking Named Entities to Any Database. In Proc. 1http://www.imdb.com of EMNLP-CoNLL 2012, pages 116–127, 2012. 2http://musicbrainz.org

INSIGHT-SC [20] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Every Aspect Counts: A Combined Approach towards Aspect Based Sentiment Analysis Sapna Negi National University of Ireland, Galway E-mail [email protected]

Abstract tagged with the aspect terms and the sentiment polarity to- We perform a fine grained sentiment analysis by identifying wards it. More than one aspect term can be present in a sentiments expressed towards different aspects of a prod- sentence. On the basis of related work and our linguistic uct or service which is discussed in an opinionated text (eg. analysis, the selected bag of word features comprise of un- customer review). We follow a linguistically motivated ap- igrams in the clause where the aspect term appears, aspect proach, in combination with the use of sentiment lexicons term itself, and the words which hold grammatical depen- and machine learning algorithms. dency relations ‘nn’, ‘amod’, and ‘nsubj’ with an aspect term. In addition we also use two numerical features which 1. Introduction are the sum of positive and negative polarity scores of adjec- Opinions expressed on social media are an important tive and verbs in the clauses, using SentiWordNet [1]. The source of discovering public sentiments towards entities like pre-processing involved stemming, and stop-word removal person, place, organization, products, services, events, etc. using our customised stop-word list. Ratings in the reviews give an estimate of overall sentiment, but do not reflect the reviewer’s opinion towards each as- 5. Evaluation pect of the product or service being reviewed. Therefore, a fine grained sentiment analysis of the opinionated text is Domain Baseline Our System required. laptop 51.07 59.15 The term aspect refers to the features or aspects of a prod- restaurant 64.28 71.44 uct, service or topic being discussed. Identifying sentiments expressed towards these aspects is referred to as Aspect Table 1: Results on Gold Standard Data Based Sentiment Analysis (ABSA) [4]. We categorise sen- timents on the basis of their polarity, which could be posi- We evaluated our method (Table 1) on the gold standard tive, negative, neutral or conflict. provided in a task on ABSA in SemEval 2014 [5]. The baseline approach was provided by the task organisers. It is 2. Problem Statement a rule based approach which mainly checks the presence of an aspect term in all the training sentences, and the similar- Positive and negative words in a text are good indicators ity of training and test sentences. of the sentiments hidden in the text. However, a text might refer to multiple aspects, with varying sentiments expressed towards them. For example, in the sentence ‘Their thin References crusted pizza was delicious, but garlic bread was stale’, sen- [1] A. E. S. Baccianella and F. Sebastiani. Sentiwordnet 3.0: An timent towards ‘thin crusted pizza’ is positive, while nega- enhanced lexical resource for sentiment analysis and opinion tive towards ‘garlic bread’. Therefore, the problem requires mining. In Conference on International Language Resources and Evaluation (LREC’10). European Language Resources an advanced solution rather than merely identifying positive Association (ELRA), 2010. and negative nature of words in the text. [2] P. Chesley. Using verbs and adjectives to automatically clas- sify blog sentiment. In AAAI-CAAW-06, the Spring Symposia 3. Related Work on Computational Approaches, 2006. Unigrams, bigrams, adjectives and part of speech tags [3] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sen- are important features for a machine learning based senti- timent classification using machine learning techniques. In Conference on Empirical methods in natural language pro- ment classifier [3]; in particular, verbs and adjectives are cessing, EMNLP ’02, Stroudsburg, PA, USA, 2002. also important [2]. Words which share certain dependency [4] I. Pavlopoulos. Aspect based sentiment analysis, 2014. relations with aspect terms, tend to indicate the sentiments [5] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, expressed towards those terms [6]. I. Androutsopoulos, and S. Manandhar. Semeval-2014 task 4: Aspect based sentiment analysis. In International Work- shop on Semantic Evaluation, SemEval 2014, 2014. 4. Proposed Solution [6] W. Theresa, W. Janyce, and H. Paul. Recognizing contextual We employ a machine learning based classifier which polarity: An exploration of features for phrase-level sentiment trains on a dataset comprising of 3000 sentences each from analysis. Computational Linguistics, pages 399–433, 2009. laptop and restaurant reviews [5]. Training sentences were

INSIGHT-SC [21] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Ethics of Ambient Assisted Living Technologies for People with Dementia Peter Novitzky Alan F. Smeaton Cynthia Chen DCU/Insight/Institute of Ethics DCU/Insight [email protected] [email protected] [email protected] Kate Irving Tim Jacquemard Fiachra O’Brolchain´ DCU/School of Nursing DCU/Insight/Institute of Ethics DCU/Institute of Ethics [email protected] [email protected][email protected] Donal´ O’Mathuna´ Bert Gordijn DCU/School of Nursing DCU/Institute of Ethics [email protected] [email protected]

Abstract 5. Results Ambient assisted living (AAL) technologies can provide as- The literature review provides the latest systematic eth- sistance and support to vulnerable persons, including peo- ical analysis of the challenges and opportunities present in ple with dementia (PwD). They might allow these persons the R&D, clinical trials, and clinical application of AAL the possibility of living at home for longer whilst still main- technologies, faced by the stakeholders involved: PwD, taining their comfort, safety and security. AAL technologies caregivers (formal and informal), researchers and clinicians, also trigger serious ethical issues. This presentation pro- software and hardware engineers, designers, and techni- vides an overview of the ongoing scholarly debate about cians. Our findings are categorised based on the ethical these issues. We address the question of what ethical issues risks and opportunities involved in: independent living, so- are involved in the various stages of R&D, clinical experi- cialisation, empowerment, safety and security, care and cost mentation, and clinical application of AAL technologies for burdens. PwD and other related stakeholders.

1. Motivation 6. Evaluation Due to increased life expectancy and falling birth rates, The goals of AAL have to be carefully assessed in the the age distribution in developed countries is gradually light of the following issues: a) AAL technologies provide shifting towards older populations. This evidence points to only non-therapeutic treatment with all its ethical implica- an unavoidable, worldwide, increase in the age profile for tions; b) lax regulation of testing standards for AAL has humankind and therefore, an increase in the prevalence of been reported [2]; and finally c) the overlapping stakehold- age-related diseases, including dementia. The ageing popu- ers’ motivations for R&D are often conflicting. Many arti- lation necessitates better and more effective healthcare sys- cles accept the benefits of AAL without critically question- tems and technologies. AAL technologies show promise in ing this presumption. The justification of AAL for PwD is contributing to a solution to this challenge. weaker if it provides benefits only for the caregivers or third parties instead of the persons in need of care primarily, and 2. Problem Statement as such violates the principle of proportionality. Responsible development of AAL technologies demands The special vulnerability of PwD pose challenges for substantial analysis of the ethical issues that might arise dur- AAL used in home environments, where there is a lack ing R&D, clinical trials or clinical practice [1]. During these of ‘safety nets’ present in healthcare institutions. Errors, stages of development, various claims and interests emerge malfunctions, and usability require very high standards for from different stakeholders. safety, security and reliability of AAL technologies. Finally, we propose defining a concept of rolling in- 3. Related Work formed consent during R&D of AAL technologies for PwD. We conducted an extensive systematic literature review Such an adaptation of the ethical requirement of informed on the ethics of AAL technologies for PwD [3]. Only a few consent should consider the special needs of PwD, while literature reviews are available on this topic [2]. actively engaging them in the R&D of AAL technologies.

4. Research Question Our research question is: what are the ethical issues in- 7. Acknowledgement volved in R&D, clinical experimentation, and clinical ap- This research has received funding from the European plication of AAL technologies for PwD and related stake- Communitys Seventh Framework Programme (FP7/2007- holders? 2013) under grant agreement 288199 Dem@Care.

INSIGHT-SC [22] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A Feature-based Analysis for Music Mood and Genre Classification Humberto Jesus´ Corona Pamp´ın and Michael P. O’Mahony School of Informatics and Computer Science, University College Dublin humberto.corona, [email protected]

Abstract 4. Evaluation Music classification helps users organise and navigate We conducted our experiments using the Million Song through the large music collections now available by en- Dataset [1] (a real-world publicly available dataset) and abling automatic playlist creation and personalised music lyrics were obtained using the LyricFind API 1. We evaluate recommendations. We address the problem of music clas- our proposed solution in a single-label supervised classifi- sification by exploiting information extracted from lyrics cation approach, measuring the results in terms of true pos- alone, using a large freely-available dataset to compare the itive and false negative rates. The results in Table 1 show performance of classifiers trained on a number of different an ensemble of sentiment, stylistic and ANEW-based meta- feature sets. Our experiments show that, over a large col- features compared with the standard vector space model ap- lection of songs, moods and genres can be represented in proach with binary term weighting (VSM). a two-dimensional space of valence and arousal, and that promising classification results can be achieved using fea- Moods Genres Algorithm tures based on this representation compared to a standard TP FP TP FP bag-of-words based representation. VSM (binary) 0.489 0.172 0.677 0.065 Stylistic 0.293 0.237 0.373 0.125 1. Motivation Arousal 0.306 0.233 0.312 0.137 Valence 0.347 0.219 0.324 0.135 Most of the research on music classification is based 0.296 0.236 0.299 0.140 on audio features [5]. However, previous work by Besson Dominance 0.294 0.236 0.302 0.140 et al. [2] concluded that semantic (lyrics) and harmonic Sentiment (tunes) information are processed independently by the Ensemble 0.414 0.196 0.483 0.103 brain. Thus, lyrics can be used in the music classification task. For example, Hu et al. [4] propose a lyric-based ap- Table 1: Metafeatures accuracy for Classification. proach to mood classification where different lyrical fea- 5. Future Work tures, such as stylistic features or features obtained from We are currently extending this work to the television the ANEW dataset [3] are explored. The results show that a domain, analysing the mood of different newscast television lyrics-based classifier can outperform an audio based clas- programs using features extracted from text transcripts. sifier for some mood categories. References 2. Problem Statement [1] T. Bertin-mahieux, D. P. W. Ellis, B. Whitman, and P. Lamere. We consider the task of music genre and mood classifi- The million song dataset. Proceedings of the 12th Interna- cation. We use lyrics as a source of information for classifi- tional Society for Music Information Retrieval Conference, cation, evaluating and extending several state-of-the art ap- (Ismir):591–596, 2011. proaches. Moreover, we explore the use of features derived [2] M. Besson, F. Faita, and I. Peretz. Singing in the Brain : from the ANEW dataset, as these map directly to Russell’s Independence of Lyrics and Tunes. Psychological Science, 9:494–498, 1998. model of affect [6], which is used in our work to derive [3] M. M. Bradley and P. J. Lang. Affective Norms for English mood categories. Words ( ANEW ): Instruction Manual and Affective Ratings. 1999. 3. Proposed Solution [4] X. Hu and J. S. Downie. Improving mood classification in music digital libraries by combining lyrics and audio. Pro- We first analyse the moods and genres of songs and ceedings of the 10th annual joint conference on Digital li- their relationship with the valence and arousal dimensions braries - JCDL ’10, page 159, 2010. described in Russell’s model to understand how this di- [5] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, mensions can be used to classify songs. Then, we use an P. Richardson, J. Scott, J. A. Speck, and D. Turnbull. Music instance-based representation of each song using (1) statis- Emotion Recognitio: A State of the Art Review. (Ismir):255– tical features derived from the valence and arousal dimen- 266, 2010. sions, (2) features based on lyric sentiment derived from a [6] J. A. Russell. A circumplex model of affect., 1980. sentiment lexicon and (3) stylistic features. Moreover, we consider an early-fusion ensemble approach in which vari- 1We gratefully acknowledge the help of LyricFind, who kindly pro- ous features are combined prior to the classification. vided full access to their API in order to legally obtain song lyrics.

INSIGHT-SC [23] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Federated Autonomic Trust Management Samane Abdi Department of Computer Science, University College Cork, Ireland E-mail: [email protected]

Abstract structing trust and managing authorization among princi- Sharing resources and information in a secure fashion is a pals. Many existing trust management frameworks such as requirement for federation of distributed principals. Proper [2, 1] are designed to specify arbitrary permissions. They access control mechanisms are required in order to manage assume unique and unambiguouspermission names are pro- access to those shared resources. To establish a federation, vided by using a global name provider’sservices. Although, principals must ensure that their resources are safe from in- global name providers provide a unique interpretation for appropriate access while sharing specific resources within each name, the principals participating in a federation may the federation. Trust management is an effective approach still use arbitrary names to represent their own resources. to manage the trust relationships across the federation and It depends on the experience of the administrator who de- to decide when it is safe to federate. This abstract presents fines the permissions to specify non-ambiguous permis- a secure federation framework based on a logic-based lan- sions. However, the design of non-ambiguous permissions guage, Subterfuge Safe Authorization Language (SSAL). should not rely on ad-hoc manner; it should be formalized in a systematic way. 1. Motivation In traditional access control mechanisms an access deci- 4. Proposed Solution sion is based on authentication and authorization processes. A logic based authorization language, SSAL, is designed Such a mechanism is suitable for closed systems, in which to support open and subterfuge free delegation for feder- the security administrator is familiar with all resources in ations. In other words, SSAL can be used as a policy the system. Therefore, the permissions that the security ad- language to construct statements and manage authoriza- ministrator specifies for accessing resources have unique in- tion/delegation relationships to automate the decision mak- terpretation in the system. Consequently, the possibility that ing process for securely sharing resources among feder- the security administrator defines the same permission spec- ated participants. SSAL uses the notion of local permis- ification for two different resources is low. However, when sion to eliminate ambiguity concerning the interpretation of the environmentbecomes decentralized and open, resources a permission and thereby avoid subterfuge attacks. In other are controlled by their owners, and resource owners (prin- words, a principal receiving two identical permission spec- cipals) decide on their own who is trusted to access their ifications cannot misuse the permissions for non-intended resources. Principals do not have a complete picture on the purposes, since the permissions have globally unique inter- name schema that are used by other principals when defin- pretations and clearly are referred to a global context. ing permission specifications for their resources. There- fore, two different principals may specify the same permis- 5. Discussions and Conclusions sion specification to access to their own resources. Conse- The notion of locally defined permission with global quently, these systems are vulnerable to deceive a principal unique interpretation is an effective approach to avoid am- and bypass the actual intention of the security mechanism. biguity in permission specifications. Local permissions pro- Lack of autonomic and systematic security mechanism in vide support for subterfuge safe authorization language, the literature motivated us to do this research. called SSAL. SSAL supports decentralized access control without relying on a central authority. A principal may de- 2. Problem Statement fine the permission specifications for its resources locally, Existing trust management frameworks lack a system- and SSAL automatically provides a global unique interpre- atic way of specifying a globally unique interpretation for a tation for those permissions. permission. Without a global unique interpretation for per- missions a principal that receives a permission in one do- References main, may misuse that permission in another domain via [1] M. Y. Becker, C. Fournet, and A. D. Gordon. Secpal: De- some deceptive, yet apparently authorized route, so called sign and semantics of a decentralized authorization language. as subterfuge. Journal of Computer Security, 18(4):619–665, 2010. [2] N. Li and J. Mitchell. Rt: a role-based trust-management framework. In DARPA Information Survivability Conference 3. Related Work and Exposition, 2003. Proceedings, volume 1, pages 201–212 A variety of trust management systems have been de- vol.1, 2003. veloped over the years to address the requirement for con-

INSIGHT-SC [24] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Finding Bounded Disjoint Paths in a Very Large Spatial Data for Optical Networks Ata Sasmaz Insight-Centre for Data Analytics, University College of Cork [email protected]

Abstract horizontally scalable and the solution can infinitely In telecommunication networks backup lines are needed scale for any country either populous or not. to keep users still connected during a failure. These backup lines should be disjoint to the actual lines as 5. Hypothesis much as possible. This work is a continuation of Spatial data for a country has millions of nodes and Passive Optical Networks on-going project and edges. When pre-processing this data, we build vertices addresses the NP-Hard problem of where the most cost- in between and associate extra information on the data, effective backup lines should be put. memory resource requirements go above higher and higher levels. Therefore processing the data part by part 1. Introduction is a better solution to run the job sequentially since In dual homing long-reach passive optical networks spatial data can be processed in chunks. This will also an exchange-site (or central office) is connected to two generate data retrieval cost but it will enable us to metro nodes, so in case there is a failure of a complete distribute the job into many servers and make parallel metro node, all customers are still connected. processing. We are interested in developing an efficient approach to process disjoint paths by chunks to find 6. Proposed Solution bounded disjoint paths between exchange-sites and We are developing a local search algorithm approach metro nodes. that retrieves a part of the network by need through a Without this approach discovering disjoint paths set queries. This would certainly have an overhead on with the least cost that visits certain points requires the performance of the processing. An example would significant memory resources when a country map is be Neo4j Graph Database that provides automated and taken in for processing. Calculating the cost for a path saved indexes on the edges and properties of nodes to is another performance problem when the function is reuse generated data structures next time we run the complex and bound to multiple features. algorithm. Although, this would certainly have an overhead on the performance of the processing in terms 2. Problem Statement of retrieving that information from the data store. The problem is find two disjoint paths from each exchange-site to two of its central offices such that the 7. Evaluation total distance-based cost of the network is minimized. We have generated the actual road network graph of The spatial data obtained from OpenStreetMap.org is Ireland with all the buildings and central offices very large to process in-memory as a whole. The raw associated with it. We are also preparing an approach to string data for just Ireland is around 1 gigabyte and it resource requirement problem of discovering least-cost becomes more and more memory-intensive when the disjoint path processing where there are millions of graph is generated. geographical locations in-memory data which should be When the data is processed part by part in chunks processed by the disjoint path finding search algorithm. there’s a trade-off between the number of queries and the size of the output of those queries. The goal is to 8. References resolve this trade-off optimally. [1] Rufini, M; Mehta, D; O'Sullivan, B; Quesada, L; Doyle, L; Payne, D (2012) 'Deployment Strategies for Protected Long- 3. Related Work Reach PON'. Journal Of Optical Communications And The details of resilient long reach passive optical Networking networks can be found in [1]. Finding maximal link [2] JW Suurballe, RE Tarjan - Networks, 1984 - Wiley Online disjoint paths in a multigraph [2]. A quick method for Library [3] Whalen, J.S.; Tellabs Inc. Res. Center, Mishawaka, IN, finding shortest pairs of disjoint paths [3]. Link-Disjoint USA; Kenney, J. Paths for Reliable QoS Routing [4] [4] Guo, Y; Kuipers, F.; Mieghem, P.V.; (2003) Link- Disjoint Paths for Reliable QoS Routing 4. Research Question How much can we minimize the trade-off of data retrieval from a graph data store when processing the spatial data not as a whole but part by part? Also how to distribute the processing to different clusters of servers so memory and CPU requirements would be

INSIGHT-SC [25] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Further Experiments in Sentimental Product Recommendation Ruihai Dong, Michael P. O’Mahony and Barry Smyth University College Dublin {ruihai.dong, michael.omahony, barry.smyth}@insight-centre.org

Abstract This work compares two content-based approaches to 2.2. Generating Recommendations product recommendation. One approach based on We assume a more-like-this type of recommendation features found in reviews (RF), one seeded by features task [1] in which the user is considering a specific that are available from meta-data (AF). Both product and is interested in receiving a set of approaches are based on the idea to combine similarity recommendations that are, in some sense, related to the and sentiment to suggest new products. Both derive product as alternatives; as such, the product under their product descriptions by mining opinions from consideration plays the role of a query Q. To score user-generated product reviews. recommendations we comparing a recommendation C and query product Q based on a hybrid score which 1. Introduction takes into account their feature similarity and their In the past, recommender systems have largely relied sentiment assignments. For the purpose of similarity either on the availability of product meta-data (for assessment we use a standard cosine similarity metric content-based recommendation) or transactional data based on feature popularity. Sentiment assessment (for use in collaborative-filtering style approaches). compares products a feature-by-feature basis. Recently, the availability of user-generated content (tweets, status, updated, reviews etc.) hints at a new 3. Evaluation and Conclusions source of recommendation knowledge in the form of the We focused on 6 different cities and extracted opinions that real users express about products and 148,704 reviews across 1,700 hotels from Tripadvisor. services that they use. We build on recent work [2] to In general, review feature tends to produce a higher exploit this form of recommendation knowledge, by rating benefit than meta-data feature for a similar query using opinion mining techniques to generate product similarity. Broadly speaking that rich product descriptions from user-generated reviews that are descriptions can be produced from user-generated commonplace on sites like Amazon, Yelp, TripAdvisor reviews and that these product descriptions can be used etc. as the basis for a practical recommendation system and positive ratings benefits are achieved. In comparison to 2. Opinionated Recommendation review-mining approach (RF), meta-data approach (AF) Recently, user-generated reviews were proposed as a offers little benefit. source of product opinions in order to generate feature- based product descriptions for use in a content-based 4. Future Work recommender system [2]. In our future work, we are interested in building user profile by examining the features mentioned by a user 2.1. Opinionated Product Descriptions in his reviews. If we imagine that the set of reviews Unlike traditional content-based recommenders — written by a user as a document, we can compute the describing products using meta-data that is sometimes TF-IDF scores for features easily in a user collection. In available — the product descriptions that we are this case, feature TF-IDF scores can reflect how interested in are derived from the opinions expressed in important a feature for a user. Thus the user personal user-generated reviews. In other words, we are interests can be represented by a feature-based vector. interested in those product features that users tend to These fine-grained user interests might be useful to discuss, whether often or rarely, and that they appear to improve the accuracy of recommender systems. like or dislike. Thus, for each product P we have a set of Secondly, we might do recommendations explanations features that are discussed in the reviews of P. And for by means of the information of the important features, each feature we can compute various properties which have high TF-IDF scores, and their including the fraction of reviews it appears in (it's corresponding opinions. popularity) and the degree to which reviews mention it in a positive, neutral, or negative light (it's sentiment). 5. References We also consider available meta-data as an original [1] R. D. Burke, K. J. Hammond, and B. C. Young. The source of relevant features. In the case of TripAdvisor, FindMe approach to assisted browsing. IEEE Expert: this meta-data is in the form of an edited set of Intelligent Systems and Their Applications, 12(4):32-40,1997. amenities (for example, spa, 24-hour reception, [2] R. Dong, M. P. O'Mahony, M. Schaal, K. McCarthy, and business centre, etc.) which is associated with each B. Smyth. Sentimental Product Recommendation. In hotel. These amenities can also serve as seed features Proceedings of the 7th ACM Conference on Recommender Systems, RecSys '13, pages 411-414, NY, USA, 2013. ACM. for sentiment evaluation.

INSIGHT-SC [26] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Improving Government Policy & Decision making with Social Data Lukasz Porwol Insight, NUIG [email protected]

Abstract re-production. The legitimacy and significance of The traditional approach to e-Participation as citizens’ contribution to policy making is strengthened technology-mediated, government-controlled dialog directly by government’s acknowledgement, between citizens and the politics sphere does not consideration and subsequent (partial) adoption. consider direct inclusion of popular Social Media (such as well-established platforms - Twitter and Facebook) as first-class communication channel nor as important information feed for political decision support. This gap, defined in literature as Duality-of e-Participation phenomena is in the center of our research work focused on development of specific infrastructure enabling seamless bottom-up incorporation of spontaneous political discussions on Social Media (Social Data) with improved, traditional e-Participation process.

1. Introduction Our work provides a first step towards understanding

of the duality of Government-led and Citizen-led e- Participation and proposes a comprehensive solution Figure 1: Integrated Model for e-Participation integrating those two distinct means of participation. 3. e-Participation Infrastructure Our theoretical framework builds upon Structuration Based on the framework developed, we have elicited Theory complemented by Dynamic Capabilities comprehensive list of e-Participation requirements and Theory. We employ Structuration Theory to understand designed a corresponding infrastructure (Figure 2) [1] how dynamics of power (drawn from relevant rules and to support the Duality of e-Participation. resources) between governments and citizens in deciding what is important for the society and the solutions to adopt could tilt towards the side of citizens through citizen-led deliberations. We also determine what are the key factors deciding of success of re- production and sustainability of the social system. Through the dynamic capabilities theory, we determine additional capabilities required to be obtained by governments in the presence of fast changing public opinion environment to meaningfully exploit and sustain citizen-led e-Participation as a part or a holistic e-Participation framework. Our goal in particular is to develop relevant methodology, a framework and corresponding infrastructure that by supporting the Duality of e-Participation will significantly improve citizen-participation and government transparency Figure 2: e-Participation Infrastructure therefore enable more engaged and constructive citizen- to-decision-makers communication, hence promoting 4. Implementation better political decisions alignment to citizens’ needs. The proposed implementation (work in progress – intended to be used for evaluation with government 2. Integrated Model for e-Participation officials and citizens) leverages state-of-the art Data The Integrated Model for e-Participation (Figure 1) Mining, NLP and Linked Data technologies powered by exploits simultaneously the classic and Social Data- specially created Semantic Model for e-Participation. supported e-Participation to ensure the dynamic distribution of allocative and authoritative resources 5. References between citizens and decision makers in the context of [1] L. Porwol, A. K. Ojo, J. G. Breslin: Harnessing the duality policy-making. Citizens given appropriate resources of e-participation: social software infrastructure design. exercise their agency to participate in the social-system ICEGOV 2013: 289-298

INSIGHT-SC [27] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Insight4News: Connecting News to Relevant Social Conversations Bichen Shi, Georgiana Ifrim, Neil Hurley University College Dublin bichen.shi, georgiana.ifrim, neil.hurley @insight-centre.org { } Abstract We present the Insight4News system1 that connects news articles to social conversations, as echoed in micro-blogs such as Twitter. Insight4News tracks feeds from main- stream media, e.g., BBC, Irish Times, and extracts relevant topics that summarize the tweet activity around each arti- cle, recommends relevant hashtags, and presents comple- mentary views and statistics on the tweet activity, related news articles, and timeline of the story with regard to Twit- ter reaction. It builds on our award winning Twitter topic detection approach and several machine learning compo- nents, to deliver news in a social context.

Figure 1: High-level overview of the Insight4News system. 1. Introduction Nowadays, more often than not, news stories break on- time-dependent n-gram and cluster ranking and headlines line long before appearing in newspapers. The landscape of re-clustering [1]. For each article’s tweet-bag, we execute news delivery and dissemination has changed dramatically these steps to obtain a set of headlines or topics that sum- in less than a decade since the widespread take-up of social marize the tweet activity relevant to the article. media. While many systems tap on the social knowledge of Hashtag Recommendation. Using the tweet-bag per ar- Twitter to help users stay on top of the information wave, ticle, we form article-hashtag pairs, and compute four fea- none is available for connecting news to relevant Twitter tures for each pair that capture the global (whole stream) content on a large scale, in real time, with high precision and local (article tweet-bag) profile of the hashtag (e.g., and recall. For example, Storyful is a social media news popularity and relevance). We have manually labeled 2,500 agency that tracks and curates Twitter content for breaking article-hashtag pairs to train a Logistic Regression classi- news and potential stories for newsrooms. The headlines fier, which provides a score describing the likelihood that a feature of Twitter provides links to articles that relate to hashtag is relevant to the article. We use this score to rank a specific tweet. Hash2News takes a hashtag as input and hashtags for each article, and recommend the top10 hash- presents relevant news articles for that hashtag. tags with highest classification score. We present Insight4News, which links news articles from mainstream media (e.g., BBC), to relevant Twitter 3. Evaluation conversations, as delivered by tweets, hashtags and auto- Our topic detection approach was assessed by practic- matically detected events and photos [3]. It provides users ing journalists and ranked first as the most effective ”news with a set of headlines summarizing the most important top- miner” with regards to several evaluation criteria, amongst ics discussed over a given time period, and provides social which were precision and recall2. The hashtag recommen- context for news articles via a machine learning algorithm dation classifier gives 87% Precision, 79% Recall, 90% Pre- that classifies and ranks hashtags. cision@1 and 88% NDCG@3 [2].

2. Proposed Insight4News System References The key components of Insight4News are illustrated in [1] G. Ifrim, B. Shi, and I. Brigadir. Event detection in twitter Figure 1. We retrieve articles from news RSS feeds, ex- using aggressive filtering and hierarchical tweet clustering. In tract keywords for each article, and feed these keywords to SNOW-DC@ WWW, pages 33–40, 2014. the Twitter Streaming API, to retrieve relevant tweets. This [2] B. Shi, G. Ifrim, and N. Hurley. Be in the know: Con- way, we obtain a local tweet-bag per article, which we use, necting news articles to relevant twitter conversations. In for topic detection and hashtag recommendation. ECML/PKDD PhD Track, 2014. [3] B. Shi, G. Ifrim, and N. Hurley. Insight4news: Connecting Topic Extraction. Relies on tweet-clustering combined news to relevant social conversations. In ECML/PKDD Demo with a few layers of filtering, aggregation and ranking. Track, 2014. The detailed steps include hierarchical clustering of tweets, 2Evaluation result is in Snow 2014 data challenge: Assessing the per- 1http://insight4news.ucd.ie/insight4news/ formance of news topic detection methods in social media.

INSIGHT-SC [28] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Integration of terminology into the CAT environment Mihael Arcan National University of Ireland, Galway [email protected]

Abstract enough to extract high quality bilingual terms? b) Which is In this research I focus on the problem of extracting and the best approach to inject domain-specific bilingual termi- integrating bilingual terminology into a Computer Aided nology into SMT systems in a real CAT environment? Translation (CAT) tool scenario. The proposed framework takes as an input a small amount of parallel in-domain data, 4. Evaluation gathers domain-specific bilingual terms and injects them in The evaluation of bilingual terminology injection into an statistical machine translation (SMT) system to enhance SMT has been performed on IT data. The relevant bilin- translation productivity. gual terms are extracted from the test set, which comes from the translation day i and are applied on the test set of day 1. Motivation i+1. Therefore two different test sets are used, coming from Professional translators deal daily with texts coming translator A and translator B. Furthermore two methods for from different domains, which require specific knowledge integrating such terms into SMT were investigated. In case of domain terms. Generic models such as Google Translate of the translator A test set (baseline SMT system 15.10), the or open source SMT systems trained on generic data are the best overall performance, with 16.65 BLEU points, is gen- most common solutions, but they often result in unsatisfac- erated by the XML markup. Comparable to the later strat- tory translations of specific vocabulary. Although online re- egy, the cache-based models [2] achieved a BLEU score sources, e.g. IATE - ‘Interactive Terminology for Europe’, of 16.61. Focusing on the second translator B (baseline are a fundamental support for translators, their continuous 14.22), the XML markup achieves a BLEU score of 15.35. use can be time demanding. For all these reasons, the auto- With 15.54 BLEU points the cache-based model produced matic integration of bilingual terminology in the SMT sys- a slightly better performance. tem is a crucial step to increase translators’ productivity. 5. Conclusion 2. Related Work In this paper, an approach to extract domain-specific The proposed work is based on monolingual term ex- terms from a small set of documents and align them across traction from a small parallel corpus and the integration of different languages is illustrated. Furthermore is shows the the aligned bilingual terminology into an SMT system. The integration of the domain-specific terms into an SMT sys- recent studies on bilingual terminology extraction, e.g. [1], tem and compares the XML markup with a novel cache- casts the extraction approach as a classification problem. As based approach. The main evaluation simulates a real CAT for the alignment task, [4] investigate the effect of integrat- scenario, where data from a working day of an translator ing bilingual terminology in the training step of an SMT can be used to improve the translation work of the following system, and analyse in particular the performance of the day. Comparing the cache-based approach with the base- word aligner. Finding the best approach to integrate MWEs line, the former outperforms the other in each setting. The in an SMT system is addressed in [3], where the authors use improvement of the IT data is 0.72 and 1.32 BLEU points, the additional knowledge as additional parallel sentences to respectively. In comparison to the XML markup, the cache- train the translation model. The work that is most similar to base approach gains only slightly improvements. this approach is the one proposed by [5]. The authors con- sider a scenario for SMT adaptation starting from a small References in-domain parallel corpus, which is then extended by ac- [1] A. Aker, M. Paramita, and R. Gaizauskas. Extracting bilin- quiring bilingual comparable corpora from the Web. These gual terminologies from comparable corpora. In Proceedings are used to extract bilingual terms that are integrated in the of ACL, Sofia, Bulgaria, 2013. SMT system using term-aware phrase tables. [2] N. Bertoldi, M. Cettolo, and M. Federico. Cache-based On- line Adaptation for Machine Translation Enhanced Computer 3. Research Question Assisted Translation. In Proceedings of MT Summit XIV, The current research addresses the integration of bilin- Nice, France, 2013. gual terminologies into the CAT tool scenario. In particular, [3] D. Bouamor, N. Semmar, and P. Zweigenbaum. Identifying the approach takes advantage of a small amount of parallel bilingual multi-word expressions for statistical machine trans- sentences produced by a translator at day i, gathers bilin- lation. In Proceedings of LREC’12, Istanbul, Turkey, 2012. [4] T. Okita and A. Way. Statistical Machine Translation with gual terminology and injects those into the SMT system to Terminology. In Proceedings of SPIP, Tokyo, Japan, 2010. enhance the translation productivity of other translators at [5] M. Pinnis and R. Skadin¸s.ˇ MT Adaptation for Under- day i+1. This is achieved by investigating the following re- Resourced Domains - What Works and What Not. Frontiers search questions: a) Can less than 200 parallel sentences be in Artificial Intelligence and Applications. IOS Press, 2012.

INSIGHT-SC [29] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A Latent Space Analysis of User Lifecycles in Online Communities Xiangju Qin, Derek Greene, Padraig Cunningham Insight: Centre for Data Analytics, UCD [email protected] Abstract ii) What are the lifecycles of users who become Collaborations such as Wikipedia are a key part of the experienced and contribute high quality (vs. low quality value of the modern Internet. At the same time there is and malicious) content? concern that these collaborations are threatened by 3. Proposed Solution high levels of member turnover. In this research we This research employs a dynamic topic model borrow ideas from topic analysis to editor activity in approach [3] to analyze how users develop throughout Wikipedia over time into a latent space that offers an their lifecycles in online communities. In analogy to insight into the evolving patterns of editor behavior. text analysis, a user's edit activity across different This latent space representation reveals a number of namespaces/topics can be regarded as a 'document', in different categories of editor (e.g. content experts, which the number of edits to that namespace is social networkers) and we show that it does provide a considered as word frequency. By aggregating the signal that predicts an editor's departure from the activity of each user on a quarterly basis, we obtained a community. The results also identify features that time-varying user activity dataset in the community. We differentiate long term users and short term users. then apply topic models to analyze the dataset and 1. Introduction identify the evolving patterns of user activity. Recent years have witnessed an increasing 4. Evaluation population of online communities, such as Wikipedia We evaluate the solution on the English Wikipedia and StackOverflow, that rely on contributions from dump, and show that the features inspired by our volunteers to build software and knowledge artifacts. analysis are beneficial for churn prediction, and that This phenomenon requires a better understanding and long term editors experience relatively soft evolution in characterization of user behavior so that the their editor profiles, while short term editors experience communities can be better managed, new services considerably fluctuated evolution in their editor delivered, challenges and opportunities detected. For profiles. Fig. 1 presents lifecycle for a long term editor. instance, by understanding the general lifecycles that users go through and the key features that distinguish different user groups and different life stages, we can develop techniques for applications such as i) churn prediction, ii) personalized recommendation, iii) expert finding, iv) malicious user detection. 2. Related Work Recent studies have modeled user lifecycles (also termed as user profiles) in online communities from different perspectives. Such studies have so far focused on separate or a combination of user properties, such as information exchange behavior in discussion forums, social and/or lexical dynamics in online platforms [1], and diversity of contribution behavior in Q&A sites [2]. These studies employed either principle component Fig. 1 The Dynamic of Profile for a long term user. analysis and clustering to identify user profiles or entropy measure to track social and/or linguistic 5. Conclusion changes throughout user lifecycles. While previous The analysis reveals that long term and short term studies provide insights into community composition, users generally have very different profiles and evolve user profiles and their dynamics, they have limitations differently in their lifespans, and that features inspired either in their definition of lifecycle periods [1] or in the by the model are beneficial for churn prediction, which expressiveness of user lifecycles [2]. opens interesting questions for future research. There has seen significant advances in topic models References that develop automatic text analysis techniques to [1] C. Danescu-Niculescu-Mizil, et al., “No Country for Old discover latent structures from time-varying document Members: User Lifecycle and Linguistic Change in Online collections [3]. This research presents a latent space Communities”, in Proc. of WWW'13, pp. 307-318. analysis of user lifecycles in online communities. We [2] A. Furtado, N. Andrade, N. Oliveira, F. Brasileiro, “ Contributor are interested in research questions such as: Profiles, their Dynamics, and their Importance in Five Q&A Sites”, in Proc. of CSCW'13, pp. 1237-1252. i) What are the differences between lifecycles of [3] D.M. Blei, J.D. Lafferty, “Dynamic Topic Models”, in Proc. of long-term and short-term users? ICML'06, pp. 113-120.

INSIGHT-SC [30] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Learning with subsets of the data Aidan Boland, Nial Friel Insight Centre for Data Analytics E-mail [email protected]

Abstract 6. Proposed Solution Markov chain Monte Carlo (MCMC) is a popular tool for If we use subsets of the data we can create an easily com- Bayesian inference, but it is too computationally intensive putable approximation to the true likelihood. The composite when applied to large datasets or distributions with in- likelihood can be defined as follows, tractable likelihoods. To overcome this computational burden q(y θ, y ) I use only subsets of the data to make inference on the overall Ai A i | − f(y θ) f(yAi θ, yA i ) = , data. For this to be useful, the resulting MCMC chain must | ≈ | − Zθ(y ) i i Ai ‘match’ the true distribution which we are interested in. Y Y − where Ai is a subset of the data. The normalising constant Z (y ) θ Ai is computationally efficient if we choose subsets 1. Motivation which are sufficiently small. There is huge interest in ‘big’ data, there are many like- We can use this estimate along with Contrastive Divergence lihoods which become intractable as the size of the data in- [1] and Stochastic Gradient Langevin Dynamics [6] to create crease. Gibbs random fields (GRF’s) are an example of such samples from our distribution of interest. a case, a GRF is a graphical model used in a variety of ar- eas such as image analysis. GRF’s suffer from the curse of dimensionality, only trivially small cases can be dealt with 7. Evaluation using standard techniques. 7.1. Ising study A graph of size 1000 1000 was simulated. An exchange × algorithm was run for 24 hours to get a ‘ground truth’ of the 2. Problem Statement true distribution. Our algorithm using the composite likeli- The likelihood of a GRF takes the form: hood and contrastive divergence was then run for subsets of sizes 8, 16 and 32. q(y θ) exp(θT s(y)) f(y θ) = | = (1) | Zθ Zθ Mean SD Time (Minutes) Exchange 0.4001 0.00044 1440 The intractability of this likelihood lies in the normalising Blocksize constant Z , this normalising constant is a sum over all pos- θ 8 8 0.4005 0.00292 15 sible graphs. To create an MCMC chain this likelihood needs × 16 16 0.3999 0.00163 45 to be evaluated at each step, various techniques are used to × 32 32 0.4001 0.00112 165 overcome this, such as the exchange algorithm of Moller et × al. [5] or Approximate Bayesian computation (ABC) [4]. Table 1: Means and standard deviations. From the table we see that the estimate of the means 3. Related Work and standard deviations improve as the subset size increases. The exchange algorithm and ABC both rely on sampling However the standard deviations are not very good estimates from the distribution of the data, this can be computationally of the true standard deviation. Further work is needed to im- intensive. Another approach is to use an approximation to prove this method so that we get better estimates. the likelihood which only uses a subset of the data. Bardenet et al [2] and Maclaurin [3] use approaches where subsets of References the data are used to make inference on the overall data. The [1] A. Asuncion, Q. Liu, A. Ihler, and P. Smyth. AISTATS, 2012. crucial part of these methods is choosing which subsets of the [2] R. Bardenet, A. Doucet, and C. Holmes. Proceedings of the 31st data to use. International Conference on Maching Learning, 2014. [3] D. Maclaurin and R. P. Adams. ArXiv (arxiv.org/abs/1403.5693), 2014. 4. Research Question [4] J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Statistics Can reasonable inference be made on the parameters of and Computing, 22(6):1167–1180, 2012. interest using only a subset of the data? How do we choose [5] J. Moller, A. Pettit, R. Reeves, and K. Bertheksen. Biometrika, which subset(s) to use? 93:451–458, 2006. [6] M. Welling and Y. Teh. Proceedings of the 28th International Conference on Machine Learning, pages 681–688, 2011. 5. Hypothesis Choose a suitable subset of the overall data, and using this subset create a Markov chain which targets the true density of interest.

INSIGHT-SC [31] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Low Cost Autonomous Sensing Platforms for Water Quality Monitoring Deirdre Cogan, John Cleary, Cormac Fay and Dermot Diamond Dublin City University, Glasnevin, Dublin 9, Ireland [email protected]

Abstract Our ability to effectively monitor the aquatic By combining a modified version of the Berthelot environment at remote locations is essential as pressure method with microfluidic technologies and LED based on available water resources continues to grow. There optical detection systems, a low cost monitoring system is therefore a growing need for low cost, remote sensing for the detection of ammonia in water has been systems which can be deployed in situ in sufficiently developed. The assay developed is a variation on the large numbers to ensure that data on key water quality Berthelot method, eliminating several steps previously parameters is readily available. associated with the method for a nontoxic and simple colorimetric assay allowing for the determination of 1. Introduction ammonia up to 12 mg/L ammonia with a limit of Monitoring and protecting the quality of our detection of 1.5 µg /L ammonia. Validation was environmental waters is of major concern today. The achieved by analyzing various water samples by the development of low cost autonomous sensor platforms modified method and ion chromatography resulting in a could provide the basis of a widely dispersed sensor correlation coefficient of 0.9954. The method was then network, providing frequent updates about the implemented into a fully integrated field deployable concentration of specific target species at many platform consisting of a sample inlet, storage units for locations. The challenges facing this ideal of monitoring the Berthelot reagent and standards for self-calibration, include the cost of these platforms and the inability to pumping system, a microfluidic mixing and detection “deploy and forget” due to limited long term stability chip and waste storage. The optical detection system and maintenance requirements. consists of a LED light source with a photodiode detector, which enables an absorbance reading from the complex formed. 2. Microfluidic Technology Microfluidic technology has great potential as a solution to the increasing demand for environmental 4. Conclusion monitoring, by producing autonomous chemical sensing Ultimately, the developed systems provide a base in platforms at a price level that creates a significant terms of monitoring waters for nutrients in situ in a impact on the existing market through minimisation of rapid, simple and inexpensive manner. reagents, standard solutions, and power consumption. The ultimate goal is that the quality of the The development of sensing platforms for ammonia and environment would be monitored in real time and at nitrate in water and wastewater are being investigated. remote locations with systems capable of detecting Our approach is to combine microfluidics with multiple target analytes. Subsequently, the data simplified colorimetric chemical assays; low cost generated could then be provided to all bodies of LED/photodiode-based optical detection systems; and interest be it monitoring agencies or the general public wireless communications. providing the much needed information for policy implications.

3. Chemical Sensing Platforms An analysis system for the direct determination of 5. References [1] D. Cogan, J. Cleary, T. Phelan, E. McNamara, M. Bowkett nitrate in water using chromotropic acid has been 1 and D. Diamond, Anal. Methods, 2013, 5, 4798-4804 developed. The chromotropic acid method has been (DOI:10.1039/C3AY41098F). modified eliminating several steps previously associated with this method to facilitate its implementation into an autonomous platform, resulting in a rapid and simple procedure to measure nitrate. The device incorporates low cost, highly sensitive detection to measure nitrate up to 80 mg/L nitrate with a limit of detection of 0.73 µg/L nitrate. Validation was achieved by analyzing water samples from various sources including groundwater, trade effluent and drinking water by the nitrate analyzer and ion chromatography resulting in an excellent correlation coefficient of 0.9924. Ultimately, this system provides a base in terms of monitoring waters for nitrate levels in situ in a rapid, simple and inexpensive manner.

INSIGHT-SC [32] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Machine Learning in Portfolio Solvers John Horan [email protected] Insight Centre for Data Analytics

SAT 2009 SAT 2011 Abstract Dataset SATzilla Portfolio SATzilla Portfolio handmade Instances 295 291 The aim of this work is to evaluate and apply machine learn- Solved by presolver 56 45 Run by default solver 0 0 ing algorithms in a portfolio-based approach to combinato- Solved 115 124 53 90 Timeout 124 115 193 156 rial problem solving. Time (sec) 510303 468326 758293 630197 random Instances 570 604 Solved by presolver 7 35 Run by default solver 0 0 Solved 154 159 272 355 1. Background Timeout 409 404 297 214 Time (sec) 770451 635777 1215683 819200 The boolean satisfiability problem, is made up of industrial Instances 329 301 Solved by presolver 39 44 boolean variables forming an expression with binary opera- Run by default solver 17 36 tors. A problem is said to be satisfiable if there exists a valid Solved 208 214 78 84 Timeout 121 115 143 137 assignment over these variables that obeys all of the binary Time (sec) 566211 527297 589558 563263 operators. Table 1: Results from the instances used in SAT competition There exist many highly competitive solvers for SAT evaluation. This resulted in RandomForest being selected problems, while these solvers share many of the same un- as the best algorithm for the portfolio. derlying features, they have been implemented and opti- mised in very different ways. This gives rise to a situa- 3.2. Ridge regression versus Random Forest tion where different solvers can handle the same problems Then a portfolio is built using randomforest as the al- with wildly varying results. Wolbert and Macready [2] es- gorithm for deciding the best solver. The same magnitude tablished that any algorithm that performs well in one class approach is used, where the machine learning algorithm is of problems is paying for that extra performance in other used to pick a subset of the solvers which it believes be- classes. The most common approach for SAT solvers is the long to the lowest magnitude category for a given problem. DPPL algorithm augmented by special techniques such as From this subset, the solver which has the highest number clause-learning, fast unit-propagation, randomization and of successfully solved instances in the case base, is selected restart strategies, among others. to run on the given problem. This disparity between algorithms has given rise to a number of portfolio based solvers, which aim to take ad- vantage of this “no free lunch” scenario by building up a 4. Evaluation portfolio of solvers and choosing among them for a given To evaluate this approach, the instances used in the 1 2 problem. SATzilla best exemplifies this approach, taking a SAT09 and SAT11 competitions are run using SATzilla problem instance and categorising it against a set of prob- and the machine learning based approach. The SAT com- lems it has examined in the past, to predict which solver petitions allocate an hour for each instance to run in, before from its portfolio would be best suited [1]. the solver is killed and recorded as a timeout. As the ma- chine learning approach depends on SATzilla to generate the features for a problem, the time taken from this is sub- 2. Motivation tracted from the allowed time for the machine learning ap- Portfolio solvers offset the time taken to choose a solver proach. Additionally it is possible for SATzilla to solve an against the time gained by running the solver best suited for instance using the solver it employs for feature extraction, a given problem, allowing portfolio-based solvers to outper- and in such cases neither portfolio solver is run. Table 1 form traditional approaches when working on a diverse set shows that this approach outperforms SATzilla both on the of problems. The accuracy of the decision on which solver number of instances is successfully solves and how much is best suited is of vital importance to the competitiveness of time it takes to do so. this type of approach. This work aims to build on SATzilla and investigate if it is possible to make a better selection of solver by using machine learning algorithms. References [1] E. Nudelman and K. Leyton-Brown. Satzilla: An algorithm portfolio for SAT. Solver description, SAT . . . , pages 1–2, 2004. 3. Proposed Solution [2] D. Wolpert and W. Macready. No free lunch theorems for 3.1. Predicting run time using machine learning optimization. Evolutionary Computation, IEEE . . . , pages 1– Firstly the machine learning algorithms J48, k-nearest 32, 1997. neighbours, RandomForest, NaiveBayes as well as OneR and ZeroR for comparison, are used to predict the magni- 1www.cril.univ-artois.fr/SAT09/ tude of running time for each solver using ten-fold cross 2www.cril.univ-artois.fr/SAT11/

INSIGHT-SC [33] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Maintenance scheduling through degradation monitoring for building service components Ena Tobin Insight Centre for Data Analytics [email protected] University College Cork, Cork, Ireland

Abstract kernel. A sliding window is utilised on the dataset to al- In recent years, there has been a lot of effort on increas- low adaptation of the kernel parameters in time. These ker- ing user comfort and energy efficiency in buildings. This nel parameters are assessed for their ability to represent the led to the increasing use of innovative building components, degradation level of a component. The Particle Filter (PF) such as thermal solar panels, heat pumps, etc.. They equip- is represented by four necessary steps: Initialisation of the ment have potential to provide better performance, energy variables; Propagation of the particles, weighted probabil- savings and increased user comfort. However, as their com- ities calculated and normalised; if necessary, resampling; plexity increases, the requirement for maintenance manage- and, posterior value, yˆ, for y is calculated. The degrada- ment increases. The standard routine for building main- tion metric is define as the particles, which represent the tenance is inspection, which results in repairs or replace- parameters of the state space equation chosen for the PF ments when a fault is found. This leads to unnecessary implementation. inspections which have a cost with respect to downtime of a component and work hours. This research addresses 3. Evaluation of Degradation based Mainte- the requirement for building maintenance performed at the point in time when the component is degrading and requires nance maintenance. The aim is also to reducing the frequency of In order to evaluate the proposed methodology, differ- unnecessary inspections. ent implementation procedures are presented. They are de- pendant on the maturity level of information gathering and storage within the facility management organisation. They 1. Introduction address situations from no failure or operational data be- There are many different techniques for scheduling ing available and to many instances of failure available. maintenance activities, such as reactive, scheduled, or The implementations were assessed based on the lead or condition-based maintenance. These make a trade-off be- lag times of the resulting degradation flags compared to the tween equipment health, cost and user-comfort. There are known degradation points. It concludes from these evalu- more comprehensive forms of maintenance employed in the ations that degradation based maintenance scheduling does process and manufacturing industry, in which there are high provide maintenance scheduling more orientated to occur- costs associated with equipment down-time. These strin- rences of critical degradation level compared to scheduled gent controls are not employed widely for building service and reactive maintenance and that Degradation based Main- components due to the relationship between cost and criti- tenance (DbM) scheduling does, for a number of implemen- cality of operation. However, a large amount of data, which tations detailed within this work, recognise degradation be- can be used to calculate degradation metrics, is available fore failure occurs for the applied case studies. The specific in relation to building service components at present. This results found from this analysis show that its effectiveness is due to the increased use of Building Management Sys- for the three case studies is dependant on the state space tems and utilisation of both wired and wireless sensors and equation utilised in the case of the PF and the input-output meters. The premise of this research is that by utilising sta- relationship utilised with respect to the GP. tistical techniques to extract, track and predict the degrada- tion level of building service components, feasible method- 4. Conclusion ologies for invoking maintenance, before failure occurs, are The norm is that scheduled and reactive maintenance is produced. acceptable for Building Service Components (BSCs), this research shows that DbM can identify reactive maintenance 2. Statistical Techniques before reactive maintenance and that it can also provide A Gaussian Process (GP) as used here, for regres- more exact scheduling compared to scheduled maintenance. sion purposes, is a function that links a set of inputs, x1, x2, ..., xn to an output. That is, y = f(x1, x2, ..., xn) + , where  is an error term. For the GPs used in this re- search, the covariance function is derived from the Matern´

INSIGHT-SC [34] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Making Value out of Lifelogs Gunjan Kumar, Houssem Jerbi, and Michael P. O’Mahony Insight Centre for Data Analytics fi[email protected]

1. Introduction Lifelogging is an emerging area of research which can help in this regard by maintaining a detailed and continu- ous record of our lives. As defined by Dodge and Kitchin (2007)[1], ‘Lifelogging is conceived as a form of pervasive computing consisting of a unified, digital record of the to- tality of an individuals experiences, captured multi-modally through digital sensors and stored permanently as a personal multi-media archive.’ The lifelog captures a detailed trace of ones life and the always-on and pervasive nature of lifel- Figure 1: Recall per user Figure 2: Recall Vs Thresholding) ogging leads to high rate of data generation. This data is distance which takes into account the sequence of activity associated with immensely rich contextual information and occurences as well as the variations at the individual level the vast quantity of such data presents a number of research of the activity occurences features. challenges; for example, the efficient storage, retrieval, an- For a given recommendation point, the current timeline notation, visualisation, summarisation, etc. is generated. Then, candidate timelines are selected from Gurrin et al.[3] have investigated the challenges of main- the user lifelogs, i.e., timelines containing the current ac- taining and retrieving relevant content from large lifelog tivity and with edit distances below a threshold. The ac- databases. Doherty et al.[2] investigated methods to au- tivities from most similar candidate timelines are selected tomatically segment the lifelogs into meaningful units or to produce a ranked list of recommended activities based activities and proposed an optimal combination of data on an adapted ranking function. subfigure We conducted sources for activity segmentation. Our works attempts to offline evaluation of our approach using the training/test take this process further and use the lifelogs thus generated paradigm. The evaluation showed a good overall perfor- for activity summarisation and activity recommendation. mance of our approach (see Figure ??). Recall values up to 0.58(user 3) and 0.95 (user 2) were seen for top-5 recom- 2. Research Question mendations. It is worth noticing that our approach consid- Our aim is to find ways in which lifelogs can be used erably outperformed the baseline recommender. Figure 2 to develop personalised applications. We are designing a shows the effect of distance thresholding, which is signfi- framework to facilitate such applications. cant in few cases. We have demonstrated that good quality recommendations can be made in the context of activity rec- 3. Current Work ommendation. We have developed a framework for summarisation and generation of personalised recommendations. We use a 4. Future Work multidimensional data model to capture the data with along As a part of our future work we will use a larger dataset with its structure. to further validate our results. We also plan to develop a Summarisation: Each occurrence of an activity is rep- collaborative approach for activity recommendation . resented by an activity object with associated dimensions like date, time, location, user, activity, etc. A continuous References chronological sequence of such activity objects constitutes a [1] M. Dodge and R. Kitchin. ” outlines of a world coming into timeline. We have proposed summarisation algorithms that existence”: Pervasive computing and the ethics of forgetting. aggregate the lifelog data over different dimension(s). Environment and Planning B, 34(3):431–445, 2007. Recommendation: We have developed a personalised con- [2] A. R. Doherty, A. F. Smeaton, K. Lee, and D. P. W. Ellis. tent based recommender system which usese lifelogs to rec- Multimodal segmentation of lifelog data. In in Proc. RIAO ommend next activitie(s) to users. Moreover, it can make 2007,Pittsburgh, 2007. recommendation of the activity context, for example, rec- [3] C. Gurrin, D. Byrne, G. J. F. Jones, and A. F. Smeaton. Archi- ommend an activity along with the start time and the loca- tecture and challenges of maintaining a large-scale, context- tion where it should be performed. aware human digital memory. An important component of this recommender system is finding the similarity between two sequences of activity ob- jects. To this end, we have proposed a multi-granular edit

INSIGHT-SC [35] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Managing The Trade-Off Between Response Time And Quality In Hybrid SPARQL Query Processing Based On Response Requirements Soheila Dehghanzadeh Insight Centre for Data Analytics [email protected]

Abstract sions. We define freshness of a query response as the per- Time and quality are two inversionally proportional met- centage of valid responses in the result set. To motivate this rics for the query response. Data warehousing and live problem, consider a user who is willing to broadcast a com- query processing optimize one or the other exclusively. A mercial advertisement and is satisfied with 80% up-to-date hybrid SPARQL query processor combines them by split- email addresses in the response set. The incentive of being ting the SPARQL query between live and local processors satisfied with less freshness is to get faster response time to provide a middle ground between above extremes. Differ- and compute less computational resources. The problem of ent splitting strategies change the trade-off among response adaptive query splitting boils down to two sub-problems. quality and time. Thus, considering the quality requirement First, it requires estimating the freshness of the response of the response while splitting the query will eliminate un- provided with current materialized data. Second, if the cur- necessary live executions and release resources for other rent materialized data could not fulfil user freshness require- queries. In order to fire live execution only on-demand, we ments, then the hybrid engine needs to redirect the most need to estimate the quality of the response provided with selective sub-queries, which can boost the response fresh- the local store and try to boost it, with the least amount of ness up to the level specified by the user, to the live engine. live execution, up to the required quality. Here we explain The second sub-problem is an optimization problem which the potential solutions for the quality estimation problems needs to minimize the live execution overhead while maxi- and compare their performance. mizing the estimated quality of the response.

1. Introduction 3. Proposed Solution and discussion There is a huge amount of RDF data on the Linked Open The first sub-problem has already been addressed in the Data Cloud (LOD) and many companies would like to con- context of relational data model [1]. However, it is not di- tribute to, for facilitating access to their own data. However, rectly applicable to the RDF data model due to the assump- processing queries over all published data is challenging. tion of having an identity key per tuple. Our main hypothe- Common approaches to integrate Linked Datasets are either sis is that statistics of cardinality estimation techniques can providing an integrated view by materializing all data at a be extended to estimate the quality profile of a query re- local repository or executing queries directly on the Linked sponse as well as its cardinality. We used the predicate Data could [2]. The first approach provides very fast re- freshness multiplication as our baseline which assumes to- sponses but of low-quality because changes of original data tal independence among query predicates and assumes data are not immediately reflected on materialized data. The sec- is uniformly distributed per dimension entry. To address the ond approach provides accurate responses but with long re- uniform distribution assumption, we build a histogram per sponse times. Thus, there is trade-off among response time dimension entry. The more granularity in histogram buck- and quality. Recently a hybrid approach for the Linked Data ets leads to more accuracy in histogram estimations. In the information integration has been proposed [3] to address future we are aiming to address the predicate independence this trade-off. However, it splits the query between live and assumption using the probabilistic graphical models[4]. local processors based on a pre-specified coherence thresh- old and does not take into account query specific quality References requirements. [1] D. Dey and S. Kumar. Data quality of query results with gen- eralized selection conditions. Operations Research, 61(1):17– 31, 2013. 2. Problem Statement [2] G. Ladwig and T. Tran. Linked data query processing strate- Sometimes quality requirements of the response can be gies. In The Semantic Web–ISWC 2010, pages 453–469. fulfilled using the materialized local store without live exe- Springer, 2010. cution. Thus, computational resources can be released for [3] J. Umbrich, M. Karnstedt, A. Hogan, and J. X. Parreira. Hy- brid sparql queries: fresh vs. fast results. In The Semantic other queries which leads to better scalability efficient load Web–ISWC 2012, pages 608–624. Springer, 2012. balancing. The critical decision of how to split the query to [4] A. Wagner, V. Bicer, and T. D. Tran. Selectivity estimation fulfil the required response quality has not been investigated for hybrid queries over text-rich data graphs. In Proceedings thoroughly and is the main focus of my work. Here, we are of the 16th International Conference on Extending Database investigating freshness as one of the main quality dimen- Technology, pages 383–394. ACM, 2013.

INSIGHT-SC [36] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Mining Opinions from User-Generated Reviews for Recommender Systems Khalil Muhammad, Aonghus Lawlor, Rachael Rafter, Barry Smyth Insight Centre for Data Analytics [email protected]

Abstract summarised towards the generation of explanations for rec- This work addresses the problem of building recommender ommendation? How can explanations be used as a supple- systems by analysing content from user-generated reviews. mentary source of knowledge for recommendation. We explore techniques for mining opinions from hotel re- We hypothesise that effectiveness of opinion-based rec- views. We propose a strategy to establish the quality of ex- ommender systems will improve with quality features. tracted features, and we study the influence of user opinions Likewise, supplementary recommendation knowledge from on the presentation and explanation of recommendation. explanations will increase the user-acceptance of the sys- tem. 1 Problem Statement The goal of this work is to continue to explore the poten- 4 Proposed Solution and Discussion tial of user-generated reviews as a source of recommenda- The starting point of our work is to identify quality fea- tion knowledge. tures. We use the approach in [1] to automatically mine First we focus on harnessing opinionated product de- opinions from a TripAdvisor dataset of 167430 textual re- scriptions in recommendation. This involves developing views for 2370 hotels. techniques for evaluating the quality of automatically mined Our experiment to measure the quality of features uses features by extending the work of Dong et al. [1]. Accord- various lexical and frequency-based filters to remove unin- ingly, we study the influence of these features in different teresting and unpopular features. The resulting features are types of recommendation and explanation tasks. mapped into a popularity grid based on their occurrence in Secondly, we investigate the feasibility of explanation- reviews and hotels. Each quadrant of the popularity grid driven recommendation. The aim is to develop an approach represents a set of features that share similar characteristics for exploiting knowledge from explanations to supplement based on their pattern of occurrence. We experiment with opinionated features during recommendation. clustering and classification techniques to find the optimal Without the loss of generality, we concentrate this re- threshold for dividing the popularity grid into feature quad- search on textual user-generated hotel reviews for travel rants. websites; hence the products to be recommended are hotels. Although the evaluation of results for this experiment is ongoing, we expect that it provide a basis for the identifica- tion of quality features. 2 Related Work Our next step is to focus on other research questions. Moghaddam et al. [3] proposed methods of extracting Consequently, we will experiment with different versions useful information from user reviews, and [1] shows how of the mined opinions and quality features to build recom- these techniques can be usefully adapted to produce the type mender systems and explanation interfaces. of product descriptions that can drive a case-based recom- mender. Tintarev et al. [4] define the possible goals of ex- planation facilities in recommender systems and show that References [1] R. Dong, M. Schaal, M. P. O’Mahony, K. McCarthy, and the presentation of recommendations influences the effec- B. Smyth. Opinionated product recommendation. In Case- tiveness of explanations; and Friedrich et al. [2] propose Based Reasoning Research and Development, volume 7969 a taxonomy for categorising explanations based on major of Lecture Notes in Computer Science, pages 44–58. Springer design principles. Berlin Heidelberg, 2013. To our knowledge, there is still no consensus on what [2] G. Friedrich and M. Zanker. A taxonomy for generating ex- constitutes a good explanation, and explanations have not planations in recommender systems. AI Magazine, 32(3):90– been used as a source of recommendation knowledge. 98, 2011. [3] S. Moghaddam and M. Ester. Opinion digger: An unsuper- vised opinion miner from unstructured product reviews. In 3 Research Question and Hypothesis Proceedings of the 19th ACM International Conference on The research questions associated with this work are: Information and Knowledge Management, CIKM ’10, pages How can we measure and evaluate the quality of automati- 1825–1828, New York, NY, USA, 2010. ACM. cally mined features? In what ways can mined opinions be [4] N. Tintarev and J. Masthoff. Designing and evaluating expla- used in different types of recommendation tasks? How does nations for recommender systems. In Recommender Systems Handbook, pages 479–510. Springer US, 2011. opinion information influence the generation, presentation and explanation of recommendation? How can reviews be

INSIGHT-SC [37] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Multi-modal Continuous Human Affect Recognition

Haolin Wei ,∗ David S. Monaghan,∗ Noel E. O’Connor ∗ Insight Centre for Data Analytics [email protected], [email protected], [email protected]

Abstract plexity of expressive behaviours found in real-world set- Automatic human affect analysis has attracted increasing tings and the use of additional depth modality could im- attention from the research community in recent years. This prove the recognition result by providing more robust facial extended abstract gives an overview of the current affect landmark detection and additional appearance features. recognition work carried out at Dublin City University. 6. Proposed Solution 1. Motivation Understanding human affective state is indispensable for human-human interaction and social contact. As computer systems have become part of our daily lives, machines that can understand human emotions could be potentially use- ful for Human Computer Interaction (HCI), customer ser- vices, call centers, E-learning, intelligent autonomous vehi- cles, games and entertainment.

2. Problem Statement Figure 1: Sample screenshot from captured affect dataset The majority of datasets and studies reported in the In this research, we propose a robust affect analysis and Affect Computing literature have used unrealistic scripted detection system for real-world applications. A state of art acted emotions. Such acted or induced data cannot model multi-modal affect dataset is first collected (See Figure 1). natural interaction sufficiently as expressions produced are The datset consists of video, audio and depth signals. The too pronounced and are seldom encountered in more real- spontaneous affective states are elicited using a three way istic data. In addition, almost all datasets are captured us- debate scenario in an unconstrained environment with var- ing controlled environments with fixed blank backgrounds ious lighting conditions and backgrounds. Secondly, the and constant lighting. Systems trained on such data are very geometric and appearance facial features along with audio likely to fail if used to understand the subtle complex nature features and additional depth features extracted from visual of spontaneous interactions in real-world applications. and vocal cues will be used to train an regression model to continuously predict the affect value. 3. Related Work Given the practical and theoretical importance of the af- 7. Evaluation fective computing field, lots of research have been con- The evaluation will be performed on each modality sep- ducted towards automatic emotion and affect recognition. arately and on the fusion of video and audio modalities as Various multi-modal spontaneous datasets have been cre- well as video, audio and depth modalities. The evaluation ated [2]. However there still lack of continuous annotated will quantify the system error based on a manually anno- spontaneous dataset consists video, audio and depth signals. tated ground-truth. Early research focused on recognising basic emotions from visual and vocal cues [1] while recent trends have shifted References towards recognising continuous and dimensional affective [1] H. Gunes and M. Piccardi. Bi-modal emotion recognition state [3]. from expressive face and body gestures. Journal of Network and Computer Applications, 30(4):1334 – 1345, 2007. Spe- cial issue on Information technology. 4. Research Question [2] H. Gunes and B. Schuller. Categorical and dimensional affect Could the depth modality increase the accuracy and ro- analysis in continuous input: Current trends and future direc- bustness of the affective recognition system? tions. Image and Vision Computing, 31(2):120–136, 2013. [3] S. Petridis and M. Pantic. Audiovisual discrimination be- tween laughter and speech. In IEEE International Conference 5. Hypothesis on Acoustics, Speech and Signal Processing, 2008. ICASSP This research will demonstrate that, the dataset captured 2008., pages 5117–5120, March 2008. in an unconstrained environment could generalise the com-

∗Acknowledgement EU FP7-ICT-281123 REVERIE

INSIGHT-SC [38] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Negative FaceBlurring: A Privacy-by-Design Approach to Visual Lifelogging with Google Glass TengQi Ye, Brian Moynagh, Rami Albatal and Cathal Gurrin Insight Centre for Data Analytics, Dublin City University, Ireland [email protected]

Abstract 5. Hypothesis Wearable devices such as Google Glass are receiving in- We believe that the visibility of an individuals face in creasing attention and look set to become part of our tech- an image is the main factor in preserving or violating that nical landscape over the next few years. At the same time, individuals privacy. For this reason, we are of the opinion lifelogging is a topic that is growing in popularity with a that effective face detection and accurate face recognition host of new devices on the market that visually capture life are fundamental to any privacy by design based lifelogging experience in an automated manner. We describe a visual system. lifelogging solution for Google Glass that is designed to capture life experience in rich visual detail, yet maintain 6. Proposed Solution the privacy of unknown bystanders. We propose employing user privacy policies for regulat- ing dynamic views over lifelog data. In this way, a user 1. Motivation (by nominating friends) is free to choose in which lifelogs Using a wearable technology such as Google Glass to their image can appear. These policies can be updated in capture a visual lifelog often results in capturing images real-time so that an individual can retrospectively add or of unknown bystanders from whom the lifelogger may not remove access rights to their identifiable image. We call have permission. Naturally this creates a tension between this approach real-time policy-driven negative face blur- data gathering and society’s concerns about an individuals ring. Haar-like Feature-based Cascade Classifiers [2] are right to privacy. Our goal is to facilitate the collection of vi- used for face detection. Eigenfaces, Fisherfaces and Local sual lifelogging data whilst maintaining the privacy of any Binary Patterns are used for facial recognition. bystanders that are unknown to the lifelogger. 7. Evaluation 2. Problem Statement A total of 8,657 lifelog images were used to evaluate the In order to allow an individual to choose the lifelogs in system. Facial detection was evaluated by counting images which their image will appear and those in which it should containing faces, in which all faces were detected and im- not, the individual needs to be able to define a privacy pol- ages that contained no faces that were correctly identified icy. Successful implementation of such a policy is depen- as such, as a pass. A pass rate of 80.68% was observed. dent upon the reliable detection of faces and accurate recog- For facial recognition, 1,300 pictures containing faces nition of detected faces within images. were randomly selected from dataset. From these images 1,310 faces were detected. The false positive rate (i.e by- 3. Related Work standers classified as friends) was 0.76% and the false neg- ative rate (i.e. friends classified as bystanders) was 29.01%. Many of the early uses of lifelogging have focused on 67.18% of the faces were correctly identified as bystanders. deploying wearable cameras to provide memory assistance or as a source of data for long-term user studies. How- ever, none of the previously developed lifelogging proto- 8. Acknowledgements types take a privacy-by-design approach. Privacy by design This publication has emanated from research supported principles are based on seven foundations [1]. From these in part by research grants from Irish Research Coun- seven principles, we choose to make privacy the proactive cil (IRCSET) under Grant Number GOIPG/2013/330 and default configuration, inherent in the design of the software, Science Foundation Ireland (SFI) under Grant Number that separates the lifelogger from the data and which re- SFI/12/RC/2289. spects the privacy of unwilling subjects and bystanders. References 4. Research Question [1] A. Cavoukian. Privacy by design. Report of the Information & Privacy Commissioner Ontario, Canada, 2012. What constitutes a violation of a bystanders privacy [2] P. Viola and M. Jones. Rapid object detection using a boosted within a visual lifelog? What measures can be taken to pre- cascade of simple features. In Computer Vision and Pat- serve a bystanders’ privacy? Can a satisfactory balance be tern Recognition, 2001. CVPR 2001. Proceedings of the 2001 reached between the goals of the lifelogger and the privacy IEEE Computer Society Conference on, volume 1, pages I– rights of unknown bystanders? 511. IEEE, 2001.

INSIGHT-SC [39] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Non Invasive Detection of Biological Fluids

Giusy Matzeu1, Conor O’Quigley2, Eoghan Mc Namara2, Cormac Fay2 and Dermot Diamond2 Dublin City University [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract microfluidic chip that was mounted on top of it (Figure The chemical composition of body fluids contains 2). crucial information about the state of health of an individual. While many efforts have been already directed toward real time analysis of blood and urine, there is still a pressing need for new solutions to non- invasively monitor other fluids like sweat. We report on the preparation of disposable potentiometric sensor strips for monitoring sodium in sweat. We also present their integration in a microfluidic chip used to harvest Figure 2. Expanded view of the different layers used to sweat in-situ during exercise. The sensor-chip is realize the microfluidic chip. integrated with a miniaturized electronic platform able to transmit data wirelessly in real time during a This configuration allows sweat to be collected through stationary cycling session in a controlled environment. a Macro-Duct directly connected to the microfluidics, as shown in Figure 3. 1. Introduction One of the key technological challenges in sensor design is providing low-cost, minimally invasive devices for in situ and real-time monitoring of chemicals produced by the human body1. The ease of access to body fluids such as saliva2 and sweat makes interesting applications of wearable sensors suitable for Figure 3. Macro-Duct system used to harvest sweat use in health care and sport science3. Combination- Real time tests were carried out with the microfluidic electrodes are prepared by screen printing in order to + reduce costs4. An appropriate solid contact material is chip, positioned on top of the Na selective interposed between the carbon layer and the drop-cast potentiometric strip, which was connected to a outer membranes of the ion-selective and reference miniaturised wireless electronic platform protected by a electrodes. The selective response of the ion-selective 3D-printed encasing. membrane is due to the presence of an ionophore, while the reference membrane is insensitive to changes in the 3. Conclusions and Future Work sample composition (Figure 1). By measuring the After validation of the potentiometric system that will potential bias between the two electrodes, the be carried out in the next future, we are planning to concentration of the primary ion in solution can be further reduce its dimensions. Current design efforts in inferred. our group are targeted towards the realisation of a wearable watch-like device.

8. References [1] D. Diamond, “Internet-Scale Sensing”, Analytical Chemistry, ACS, 2004, 278A-286A. [2] C. Zuliani, G. Matzeu and D. Diamond, “A potentiometric Figure 1. Configuration of the screen printed electrodes disposable sensor strip for measuring pH in saliva”, tuned as Ion Selective or Reference Electrodes. Electrochimica Acta, Elsevier, 2014, 292-296. [3] C. Zuliani and D. Diamond, “Opportunities and 2. Results Challenges of using ion-selective electrodes in environmental monitoring and wearable sensors”, Electrochimica Acta, The Na-ISEs were first calibrated vs a standard double Elsevier, 2012, 29-34. liquid junction Ag/AgCl reference electrode (in the -5 -1 + [4] O. D. Renedo, M. A. Alonso-Lomillo, M. J. Martinez, range of interest of 10 -10 M Na ). They were then “Recent Developments in the field of screen-printed monitored using a miniaturised solid contact reference electrodes and their related applications”, Talanta, Elsevier, electrode realised on the screen printed substrate. The 2007, 202-219. potentiometetric strip was finally integrated on a

INSIGHT-SC [40] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Object segmentation in images using EEG signals Eva Mohedano, Graham Healy, Kevin McGuinness, Xavier Giró-i-Nieto, Noel E. O’Connor, and Alan F. Smeaton Insight-Centre for Data Analytics (DCU) and Image Processing Group (UPC) {eva.mohedano, graham.healy, kevin.mcguinness, noel.oconnor, alan.smeaton}@insight- centre.org, [email protected] ! ! In this paper, we propose a system that aims at requiring attentional orientational to stimuli, such as segmenting an object within a photo by using a BCI searching for a target in a stream of images. (Brain-Computer Interface) as the interaction method. After filtering of the EEG signals, it is thus possible This work constitutes a proof of concept around the to train a Support Vector Machine (SVM) to feasibility of precise object detection by analyzing and successfully classify the signals with an accuracy of classifying the brain reaction to visual stimulus. The 70%. It is then possible to generate a probability map interaction is thus completely hands-free, requiring only (EEGmap) for the location of the object, where the a user to watch the screen. values assigned to each block in the probability map ! correspond to the SVM scores. 1. Motivation The probability maps obtained in this way represent The steady decrease in the cost of EEG a first approximation of the location of the object, but (Electroencephalography) systems in recent years they are not sufficiently accurate for image makes this sensing modality more accessible beyond segmentation. For this reason, the information obtained the traditional disciplines that typically avail this from the probability maps is used to seed the Grabcut technology. As a result, there has been clear interest in segmentation algorithm to perform the final object the research community to investigate the potential segmentation. This step is performed by firstly applying usefulness of EEG in a range of BCI-application a low pass Gaussian filter on the probability maps that scenarios. To date, much of the multimedia-related blurs the windowing effect, and secondly binarizing the work has focused on using EEG for large-scale image maps (Figure 1). retrieval [1], where EEG signals can be used to identify ! a subject’s detection of a target in a stream of images. ! Given the unique visual recognition capabilities of the ! human brain, detecting such signals offer an avenue to utilise human intelligence to perform tasks a computer cannot such as recognising the significance of an image or part of an image. In this paper we explore how this approach when combined with a suitable image Figure 1 All the steps from the EEGmap extracted from the presentation paradigm can be used to extract brain signals to the final segmentation. In this figure, the information about the content of an image itself with EEGmap is the average of 5 different EEGmap’s users. the aim of segmenting it. By identifying brain activity ! that correlates with an image or part of an image being 3. Results recognised by the user as significant, we can use this The quality of the final segmentation is assessed by information to drive a variety of applications requiring computing the Jaccard Similarity index between the object segmentation and identification. final object mask and the ground truth mask for that ! image. The index is computed as the intersection of the 2. Method mask, divided by the union of both masks. The value of the index varies from 0 to 1, where 1 represents the The approach adopted is based on the Rapid Serial ideal segmentation. The system tested in 5 users, Visual Presentation (RSVP) of the different blocks that provided a final averaged Jaccard of 0.47, meanwhile, compose an image containing an object of interest. This when the EEGmaps where averaged across users, the involves dividing an image into different parts and Jaccard increased up to 0.76. displaying each of these in fast succession on a Our system shows that it is possible roughly locate computer screen. The image is assumed to contain a and delineate an object in an image using EEG data and relatively small percentage of target stimuli (typically it becomes a proof if concept that opens the door to new 15%) defined as the blocks that contain part of the interactive modes. object of interest. First results shown that brain activity associated with ! the presentation of parts of the target object are 8. References detectably different from those associated with regions [1] N. Bigdely-Shamlo, A. Vankov, P.R. Ramirez, and S. representing the background (distractor stimulus). This Makeig, “Brain activity based image classification from rapid effect is a result of a well-known class of EEG signals serial visual presentation”, Neural Systems and Rehabilitation Engineering, IEEE Transaction on, vol. 16 no. 5, pp. 432-441, known as the P300 which can be elicited in tasks 2008

INSIGHT-SC [41] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Occupant Location Prediction Using Association Rules Conor Ryan University College Cork [email protected]

1. Motivation 5. Hypothesis Heating, ventilation, air conditioning (HVAC) Our hypothesis is that a set of association rules systems are significant consumers of energy. In order to which accurately predicts future occupant locations can operate them with maximum efficiency, building be generated by mining the historical location data of management systems must take into account occupant those occupants. While association rule mining is movements. However, due to the response time of designed to find correlations in simple sets of items, we HVAC systems, this requires advance knowledge of the can modify the approach to take account of the specific occupant locations. attributes of occupancy data to reach higher accuracy. Furthermore the approach can be extended to learn 2. Problem Statement more general concepts and sequences in the data to Given a dataset of historical occupant movements, extend the range of patterns which can be found. we require a system which can use this data to accurately predict future occupant movements. We 6. Proposed Solution require both bulk occupancy levels for zones, and We organize the historical occupant location data individual occupancy of, for example, single-person into a set of instances where each instance is a single offices. We require both predictions into the near future, day for a single occupant. We divide each day into a set e.g. location an hour from now, and further into the of time slots. Each instance therefore contains the future, e.g. location at given times tomorrow. To this location of the occupant during each timeslot, the end, we wish to find any patterns in the historical data occupant’s scheduled location during each timeslot for which could be used to make predictions. which timetable data is available, and metadata about the location data such as the occupant’s identity and the 3. Related Work day of the week. Existing work on predicting occupant locations uses We make several modifications to standard various methods including bayesian networks, neural association rule mining to improve its performance on networks, state predictors, hidden markov models, occupancy data, including modifying the metric used to context predictors, eigenbehaviours. Most of the rate patterns to take account of the concept of patterns existing approaches predict each occupant’s next which apply to specific days and/or specific occupants location using only the sequence of their most recent rather than to the entire dataset, and incorporating the locations. One exception is the bayesian network temporal relationship of the timeslots by specifically approach which uses the current day, time and location searching for patterns which relate to small sets of to separately predict the next location and duration of consecutive timeslots. stay in the next location. We also consider association rules which refer only to sequences of locations, without the relationship to 4. Research Question specific timeslots, in order to find patterns of locations We propose to use association rule mining to find that do not repeat at a specific time. These rules can be useful patterns in the occupants’ historical movements. used in conjunction with the timeslot-specific rules to We believe that in many cases occupant movements achieve the best prediction accuracy. relate to factors such as time of day, day of week, time of year, as well as location earlier on the same day. By 7. Evaluation applying association rule mining to the historical data We evaluate our approach on an external dataset we can find patterns related to any factors of this type, which has previously been used to evaluate various including further examples such as week of the month, approaches to occupant location prediction, and on a weather, semester, etc. We can also incorporate any dataset collected internally. We compare our timetable data which is available for the occupants, for approaches to a variety of existing approaches on both example meeting schedules. datasets. We find that our approach can match or We wish to determine whether the rules yielded by exceed the accuracy of the existing approaches on next- applying association rule mining to historical occupant timeslot predictions. In addition our approach can make movements can predict occupants’ future locations with next day predictions, where we achieve a minimum the same or greater accuracy than existing approaches. 86% accuracy on occupants’ locations at or before We also wish to determine whether this approach can 9.30am, and an average of 75% accuracy over the entire find extra patterns not found by existing approaches. day.

INSIGHT-SC [42] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Periodicity Detection in Lifelog Data Feiyan Hu Alan F. Smeaton Eamonn Newman Dublin City University Dublin City University Dublin City University [email protected] [email protected] [email protected]

Abstract What methods can be used to give feedback of detected Lifelogging technology is getting attention from industry, periodicity to user? academic and market, such as wearable sensors and the concept of smart home. We are proposing a framework 5. Hypothesis which could handle those aggregated multimodal and lon- Can we not only detect events, but also identify and gitudinal data. The system will take advantage of the rich mine periodicity patterns with considering time correlation information carried chronologically and implement process within one modality and/or cross-correlation between dif- such as data cleaning, low and high level patterns detection ferent modalities and finally make good effect to users. and giving feedback to users.

1. Motivation 6. Proposed Solution Multinational companies launched platform that collect- Time series analysis techniques, such as autoregression ing data from wearable sensors and centralizing the man- will be used to model each single modality and/or multi- agement of those collected data. HealthKit, HealthVault ple modalities. Machine learning methods will be applied and Google Fit coming in succession, revealing a trend that to model the correlation among different modalities. We dealing with multimodal, longitudinal data. Challenges we will detect the periodicity of time series using a combina- are facing is to mine hidden patterns in collected data, espe- tion of correlograms and periodograms, using various sig- cially reoccurring patterns, and make it useful for users. nal processing algorithms. We will use graph-theoretic ap- proaches such as the horizontal visibility algorithm to de- tect low level periodicity in time series. Automatic feature 2. Problem Statement extraction methods like deep learning will be used to map Events an/or activities people doing generate data that low-level data to high-level semantic labels. The resultant could be captured by sensors. Uniqueness of when and stream of symbols will be analysed for partial and full peri- how people doing those events and/or activities indicates odicities using ”segment and symbol periodicity detection”. person’s lifestyle. Lifelogging researchers are able to seg- ment events/episodes, detect concepts in lifelog data and model probability distribution of the concepts. But there 7. Evaluation is much more information contained in the chronological The quality of the analysis will be evaluated using quan- relationship within lifelog data, such features as periodicity titative techniques on annotated data, and also qualitatively of lifestyle in daily routine. by presenting results to users for feedback. To evaluate re- sults quantitatively we will calculate the accuracy of pre- 3. Related Work dicted missing data gap. And compare annotated period- Challenges of effective structuring, searching and brows- icity with detected periodicity in real data.To evaluate pe- ing of this image collection for locating important or sig- riodicity detection qualitatively, we need user involvement. nificant events in a person’s life has been addressed and a Interviews would be a good way to confirm detected peri- media process which could 1) capture and upload images 2) odicity with users. Cross-user comparison can be used to post processing images 3) accessing images, has been de- find out common patterns in user interviews. scribed in [2]. Based on previous researches, [1] raised a framework which enable effective memory retrieval facili- References tate with reminiscence therapy. [1] A. R. Doherty. Providing effective memory retrieval cues through automatic structuring and augmentation of a lifelog of images. PhD thesis, Dublin City University, 2009. 4. Research Question [2] H. Lee, A. F. Smeaton, N. E. O’Connor, G. Jones, M. Blighe, In order to implement a system that is able to mining D. Byrne, A. Doherty, and C. Gurrin. Constructing a sense- reoccurring patterns in aggregated multimodal longitudinal cam visual diary as a media process. Multimedia Systems, data, the following research questions can be addressed: 14(6):341–349, 2008. What Data cleaning techniques can we use to fill the missing gaps? How can we detect periodicity in low-level and high- level data series? What methods can we use to transform low-level data to high-level data?

INSIGHT-SC [43] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Proactive Workload Consolidation for Reducing Energy Costs Milan De Cauwer University College Cork, Cork [email protected]

Abstract 4. Hypothesis Datacentre energy requirements have grown massively in We leverage a semi-online optimisation method in which the last few years. One of the optimisation challenges for perfect information is considered on a few time periods reducing its energy requirements is to keep servers well ahead. Considering only a small window allow us to evalu- utilised by deciding which Virtual Machines (VMs) to mi- ate how much of the future one must predict in a real world grate, where and when to migrate, and, when and which scenario. Our hypothesis is that one only needs to forecast servers to switch on/off. Achieving this optimally requires over a short time span to compute feasible approximations the capability of predicting the future demands accurately to an optimal solution. and computing the plan for migrating VMs. We call this the Proactive Workload Consolidation Problem (PWCP). 5. Evaluation As an illustration of the trade-off between energy cost 1. Motivation and quality of service, on Figure 1(a) and one can see val- Optimising the PWCP as an offline problem with infinite ues of the aggregation of energy and transition cost (EC + time windows is impossible both for forecasting demands TC) and the migration cost function of the number of time and optimal assignments of VMs to servers. Our experi- periods considered at once . As the number of time periods ments aim to study how much of the future one should know w rises, we observe that the solutions are showing dramatic in order to retain a solution to the problem under different improvements in terms of electricity cost. This is due to scenarios of the problem. the opportunity to build solutions in which VMs are more easily moved from a server to an other one. On the down- 2. Problem Statement side, if the solution admits too many migrations users may PWCP aims to assign a server to each VM consider- experience degraded quality of service. We also study the ing bot current and future demands. The objective is to minimise the energy cost over a time horizon without vi- olating Service Level Agreements (SLAs). The problem at hand falls into the family of workload consolidation prob- lems [2, 1].

3. Research Questions The question is how far one is required to look ahead in terms of the number time-periods and still retain a minimal energy cost of a given horizon without violating the SLAs. impact of iterative solving on the feasibility of the problem under various settings of the couple (w, N). Figure 1(b) 3.1. Leveraging prediction methods suggests that we violate the migration constraint up to 4% Predictive methods are powerful while forecasting over of the time restricted models. These figures are respectively short time periods. Is PWCP a problem for which forecast- reached for small values of both parameters. For problems ing in a small time window ( small w) is enough to approx- less tightly constrained or more future aware, the risk to vi- imate feasible and optimal solutions ? olate the migration constraint tends to zero, guaranteeing a proper QoS level. 3.2. Studying on-line problems PWCP is an on-line problem, we thus consider that we do not have perfect information about the future VM re- References quirements. Our goal is to empirically study how much [1] A. Beloglazov. Energy-efficient management of virtual ma- chines in data centers for cloud computing. 2013. of the future one is required to forecast to solve and ap- [2] N. Bobroff, A. Kochut, and K. Beaty. Dynamic placement proximate PCPW with iterative solving. A strong feature of of virtual machines for managing sla violations. In Integrated PCPW is the migration constraint that constrains how much Network Management, 2007. IM’07. 10th IFIP/IEEE Interna- the solution is allowed to change between any two consecu- tional Symposium on, pages 119–128. IEEE, 2007. tive time periods. This is called the tightness of the problem. Can we quantify the impact of the problem’s tightness (N) on the feasibility and quality of the solutions ?

INSIGHT-SC [44] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Probabilistic analysis of latent space models for social networks 1,2, 1,2 2,3 Riccardo Rastelli ∗, Nial Friel and Adrian Raftery 1Insight Centre for Data Analytics 2School of Mathematical Sciences, University College Dublin, Ireland 3Department of Statistics, University of Washington, Seattle, USA ∗[email protected]

Abstract 3. Model assumptions Models that make use of a geometrical framework to ex- In a latent space model, unobserved coordinates in the plain the structure of complex networks were first intro- Euclidean space Rd are associated to each actor. Then, the duced many decades ago. They have been studied in a probability that two actors are tied by an edge depends on mathematical approach under the name of random geomet- the distance between the two corresponding points: ric graphs and random connection models [3]. However, 1 estimation and inference on such models have been consid- p = τ exp (z z )0(z z ) ij −2ϕ i − j i − j ered only in recent times, awakening a keen interest of the   statistical community. In this work we provide a probabilis- where the zs are the latent positions and ϕ and τ are model tic analysis of the statistical properties of a particular class parameters. of random connection models: the latent space models. A The points are distributed in the latent space according to framework based on probability generating functions sim- a sequence of IID multivariate Gaussian distributions cen- ilar to that used in complex networks is adopted to derive tered in the origin and with diagonal covariance matrix γI. analytical results concerning transitivity, degree distribu- tions and connectivity. 4. Solution The probability generating function for the degree of an 1. Motivation arbitrary actor is given by: In recent years, modeling of networks has begun to grow rapidly, drawing interest of many researchers with different n G(x) = φ (zs; 0, γ)[xθ(zs) + 1 θ(zs)] dzs, backgrounds. Data representing social relations can be col- d − ZR lected in the form of a graph and studied through statistical where φ ( ; 0, γ) is the d dimensional normal density cen- models, which have gone beyond applications to social sci- · tered in the origin and with covariance matrix γId; and: ences, proving to be appropriate in different fields such as biological sciences, epidemiology and computer sciences. θ(z ) = φ (z ; 0, γ) p dz . The prohibitive amount and the random nature of collected s j sj j Rd data makes network analysis a tough task. Complex net- Z works provide models with a very simple structure that al- From these equations all the results can be derived. low a probabilistic study of their asymptotical properties, but they present relevant shortcomings when used to rep- 5. Results proposed resent real data. Conversely, statistical models are typi- The clustering coefficient C as well as the degree prob- cally more suited to the problems of plausibility and esti- ability pk are written explicitly for the model proposed. In mation, leading to a more complicated structure that leaves contrast with non-structured complex networks models, the no place for a study of their statistical properties. In this clustering coefficient shows an asymptotical non-zero limit, work, we use distinctive ideas and methods of complex net- highlighting the clustering nature of latent space models. works, originally introduced in [2], to study a well-known The analytical expression for the degree distribution allows class of statistical models: latent space models. instead to assess whether heavy tailed distributions can be represented. 2. Problem Statement Introduced in [1], latent space models have drawn much References attention due to their inherent ability to easily represent so- [1] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space cial networks’ peculiarities, such as transitivity, homophily, approaches to social network analysis. Journal of the ameri- clustering, scale-free and small-world behaviours. How- can Statistical association, 97(460):1090–1098, 2002. ever, this advantage has not been studied and assessed in [2] M. E. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applica- a more formal and objective approach. Thus, the family of tions. Physical Review E, 64(2):026118, 2001. networks represented by latent space models is not known [3] M. Penrose. Random geometric graphs, volume 5. Oxford exactly. It is important indeed to understand to what degree University Press Oxford, 2003. the network’s attributes will be explained.

INSIGHT-SC [45] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Quality on the Web of Data Emir Munoz˜ Fujitsu Ireland Limited, National University of Ireland, Galway E-mail: [email protected]

Abstract 4. Research Question The proliferation of data on the Web has opened many op- From the analysis of the state-of-the-art in Linked Data portunities and challenges, but also raised concern about Quality, here we propose the following research questions its quality. With many users involved in the Web of Data, that will guide our work: 1) What dimensions can be used unintentional and intentional errors are possible. This work to measure the quality of LD datasets?; 2) Can we train a aims to design methods and metrics that help users to mea- spam detector for URIs and text in LD?; and 3) Can external sure the quality of Linked Data (LD) datasets. knowledge be used to detect erroneous data in LD?

1. Motivation 5. Hypothesis Nowadays, data is considered as one of the main assets The hypotheses considered in this work are: 1) RDF by companies, government agencies and users in general. statements appearing in different external knowledge bases Everyone agree that the rapid growth of the Web of Data can be considered trustworthy; 2) atypical values in RDF brings plenty of opportunities and challenges to face, that properties can be detected using statistical analysis; and can be used to increase their profits. This new commercial 3) the textual values for a given property follow a limited interest on Linked Data also raised concern about the data and small set of (structural) patterns. quality. For a user to use the Web of Data, she requires a 6. Proposed Solution way to ensure that no internal data-rule or business-rule of While previous research focused on quality at dataset her process will be violated. Data is high quality ”if they level. We propose a fine-grained approach at RDF state- are fit for their intended uses in operations, decision mak- ment level that includes: 1) to use statistical tools to define ing and planning” as said Roebuck [3]. For instance, the the most relevant dimensions to consider in a quality analy- following set of RDF statements (facts), sis; 2) pattern recognition to determine the presence of out- _:m a bibo:Quote . liers in data; and 3) to train classifiers to detect spam RDF §_:m bibo:content "Buy WATCHES on replicaking.com" . ¤ statements, based on: SMS spam, and URL spam corpora. _:m dc:creator "Tim Berners-Lee" . This allow us to weigh LD content in a fine-grained manner, state¦ that Tim Berners-Lee, the creator of the World Wide¥ giving us insights about its quality and trustworthiness. Web, is author of a quote that suggests to buy watches on- line. It is highly probable that he did not say that, and this 7. Discussion data was published by the owner of replicaking.com There is no real solution to detect spams in LD yet. We website. This is considered as misattribution or spam data. will use the corpus published in [1] to build and validate our approach on detecting spam. Preliminary results on DBpe- 2. Problem Statement dia showed the existence of patterns in LD that allow the detection of lexical errors and outliers. To detect further Publishers of data are prone to publish misleading in- errors we are planning to use search engines as external formation as LD. Further, some users intentionally publish knowledge bases to find evidences about accuracy of RDF wrong data with the goal of generating a financial gain. statements (a.k.a. textual entailment). Then, methods and metrics for measuring quality of the Web of Data are crucial to gain trust and confidence. 8. Conclusions 3. Related Work In this manuscript we described an ongoing work, were Quality analysis in Linked Data is a recent topic. A we present a proposal to analyse the quality of LD datasets previous work [2] analyses 12M RDF statements crawled in terms of entailment (i.e. evidence) and spamming. from over 150K URLs identifying different sources of prob- lems: incompleteness, incoherency, hijacking and inconsis- References tency. The identified problems, such as syntax errors, noise [1] A. Hasnain, M. Al-Bakri, L. Costabello, Z. Cong, I. Davis, and inconsistency can be considered as ‘accidental’ errors and T. Heat. Spamming in linked data. In COLD, 2012. [2] A. Hogan, A. Harth, A. Passant, S. Decker, and A. Polleres. while others, such as non-authoritative contributions over Weaving the pedantic web. In LDOW, 2010. part (or full) vocabularies can be considered as ‘premed- [3] K. Roebuck. Data Quality: High-Impact Strategies - What itated’. In [1] authors introduce the prospect of “Linked You Need to Know: Definitions, Adoptions, Impact, Benefits, Data spam” where misleading information is deliberately Maturity, Vendors. Emereo Pty Limited, 2011. published with financial purposes.

INSIGHT-SC [46] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Random Manhattan Indexing

A Randomized Scalable Method for Semantic Similarity Measurement in L1 Normed Spaces Behrang Q. Zadeh Insight Centre for Data Analytics National University of Ireland, Galway [email protected]

Abstract RMI employs a two-step procedure. First, context el- Vector space models are well-defined mathematical repre- ements are assigned to index vectors. Index vectors are sentation framework that have been widely used in text an- unique and generated randomly such that entries ri of in- alytics. In order to deliver a solution for problems that re- dex vectors have the following distribution: quire a minimal level of text understanding, in these mod- 1 s − with probability els, text units are represented by high-dimensional vectors. U1 2 The constructed vector spaces are endowed with a norm ri = 0 with probability 1 s , (1)  − structure and a distance formula is employed to compute  1 with probability s the similarity of vectors, thus, the similarity of the text units U2 2  that they represent. The high dimensionality of the vectors, where U1 and U2are independent uniform random variables however, is a barrier to the performance of these models. in (0,1). In the second step, each text unit is assigned to a We introduce Random Manhattan Indexing (RMI) for the context vector ~vc where initially all the elements of ~vc are construction of L1 normed vector space models of seman- set to 0. For each encountered co-occurrence of a text unit tics at reduced dimension. RMI is a two-step incremental and a context element, the context vector of the text unit is method of vector space construction that employs a sparse accumulated by the index vector ~ri of the context element, stable random projection to achieve its objective. i.e. ~vc = ~vc + ~ri. The result is a VSM at reduced dimen- sion that can be used to estimate the pairwise L1 distance 1. Motivation between text units in the model. In the constructed vector Distributional approaches to semantics tie the meaning space, the logarithmic geometric mean can be used to esti- of text units to their usage context. These methods attempt mate the L1 distance between vectors: to quantify the meaning of text units by investigating their m 1 distributional similarities. A vector space is an algebraic Lˆ (~u,~v) = exp( ln( u v )). (2) 1 m | i − i| structure that can be employed to represent such distribu- i=1 X tional similarities. In this model, the relative proximity of vectors to one another interprets the meaning of text units The proposed method can be verified mathematically and that they represent. However, as the number of text units has been validated by a set of experiments [2]. In addition, that are being modelled in a VSM increases, the number a computationally enhanced variation of the RMI method of contexts that are required to be utilized to capture their can be found in [1]. meaning escalates. This phenomenon is explained using power-law distributions of text units in contexts. For ex- Acknowledgment ample, Zipf’s law states that most words are rare, while This publication has emanated from research conducted few words are used frequently. As a result, extremely high- with the financial support of Science Foundation Ireland un- dimensional vectors, which are also sparse, represent text der Grant Number SFI/12/RC/2289. units. The high-dimensionality of the vectors results in ob- stacles, which are known as the curse of dimensionality.A References dimension reduction method is often required to alleviate [1] B. Q. Zadeh and S. Handschuh. Random manhattan indexing. these problems. In Proceedings of 25th International Workshop on Database and Expert Systems Applications(DEXA), 2014. 2. Random Manhattan Indexing [2] B. Q. Zadeh and S. Handschuh. Random manhattan inte- ger indexing: Incremental l1 normed vector space construc- We employ stable random projections and introduce the tion. In Empirical Methods on Natural Language Processing, RMI technique for the incremental construction of vector 2014. spaces in L1 normed spaces. In this method, the dimension of the vector space is fixed, independent of the text-data, and known prior to the task of vector space construction. The method, thus, is an excellent choice for processing big text-data, at large scale such as web.

INSIGHT-SC [47] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Real-time Algorithm Configuration Tadhg Fitzgerald University College Cork tadhg.fi[email protected]

Abstract Two versions of our system were evaluated, Online- We present a novel approach for real-time algorithm con- Cold, where we initialize with 6 random parameterisations, figuration while processing a stream of instances by using and Online-Warm where we start with 5 random parame- multi-core CPUs to evaluate parameters in parallel. terizations, and the CPLEX defaults. SMAC is used for comparison. 200 instances are solved by the CPLEX de- 1. Introduction fault parameters while SMAC trains, then the remainder are Search algorithms have a large number of parameters solved using the parameters which SMAC finds. We train controlling various aspects of their behaviour. Finding the multiple versions of SMAC and present both the average correct parameter settings can often improve an algorithm’s performance (SMAC-Avg) and also the virtual best perfor- performance, even by an order of magnitude [7, 5, 9]. Se- mance (SMAC-VB). lecting good settings is known as algorithm configuration. Traditionally algorithms were configured manually, however this was time-consuming, error prone and required expert knowledge. For this reason recent research has fo- cused on automatic algorithm configuration [2, 3, 6, 1, 4]. Generally automatic algorithm configuration is achieved by training offline on a set of representative instances. There have been a number of approaches taken to this problem. F- Race [2] and its successor, Iterated F-Race [3], race a small set of parameters against each other removing those that Figure 1: The x-axis specifies the number of instances observed, are under-performing. ParamILS [6] uses a focused iter- the y-axis specifies the cumulative average runtime in seconds. ated local search in order to find good parameter settings. Figure 1 shows the cumulative average solving time for GGA [1] uses a genetic algorithm approach to finding good all methods (500s timeout). Notice that the warm-started parameter configurations. Sequential Model-based Algo- version is improving from the very beginning while the cold rithm Configuration (SMAC) [4] generates a model to pre- started is able to overtake the CPLEX defaults after only dict the likely performance if certain parameters are used. 400 instances. When the evaluation has finished the aver- age time of a solver using our approach is half that using the 2. Real-time Algorithm Configuration CPLEX defaults. The performance of our approach almost Online algorithm configuration is the task of learning matches that of SMAC despite the fact that our approach re- new parameters while processing a stream of incoming quired no pre-training while SMAC spent 48 hours training problem instances (e.g. vehicle routing problems). There offline. is no offline learning step and the parameters used change if the types of instances encountered change. To the best of References [1] C. Ansotegui, M. Sellmann, and K. Tierney. A gender- our knowledge there has been no other research in this area. based genetic algorithm for the automatic configuration of al- Our approach to online algorithm configuration uses the gorithms. In Proceedings of CP, pp. 142–157, 2009. power of modern multi-core CPUs to evaluate multiple [2] M. Birattari. A racing algorithm for configuring metaheuris- tics. In Proceedings of GECCO, pp. 11–18, 2002. solvers on the same instance using different parameter con- [3] M. Birattari, Z. Yuan, P. Balaprakash, and T. Stutzle.¨ F-Race figurations. The parameterization which delivers a result in and Iterated F-Race: An Overview. In Experimental Meth- the fastest time (or in the case that all time-out, the param- ods for the Analysis of Optimization Algorithms, pp. 311–336. eterization with the highest objective value) are considered 2010. [4] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model- the winner. By tracking the parameter settings which win based optimization for general algorithm configuration. In often we can decide when one parameter configuration is Proceedings of LION, pp. 507–523, 2011. dominating another. Weaker settings are then replaced with [5] F. Hutter, H. Hoos, and K. Leyton-Brown. Parallel algorithm configuration. In Proceedings of LION, pp. 55–70, 2012. a new set chosen at random. By including the best known [6] F. Hutter, H. Hoos, K. Leyton-Brown, and T. Stuetzle. parameter set among the starting set of parameters, we can ParamILS: An automatic algorithm configuration framework. guarantee a certain level of performance. JAIR, 36:267–306, 2009. [7] S. Kadioglu, Y. Malitsky, M. Sellmann, and K. Tierney. ISAC 3. Evaluation - Instance-Specific Algorithm Configuration. In Proceedings of ECAI, pp. 751–756, 2010. Our approach was tested on a set of 2000 combinato- [8] K. Leyton-Brown, M. Pearson, and Y. Shoham. Towards a rial auction instances(a type of auction where participants universal test suite for combinatorial auction algorithms. In bid on combinations of items[8]). The instances were or- Proceedings of ACM-EC, pp. 66–76, 2000. dered by the number of goods and those taking less than [9] Y. Malitsky, D. Mehta, B. O’Sullivan, and H. Simonis. Tuning parameters of large neighborhood search for the machine reas- 30 seconds or more than 900 seconds using default CPLEX signment problem. In Proceedings of CPAIOR, pp. 176–192, parameters were removed. 2013.

INSIGHT-SC [48] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Realtime Keyword Specific Mining of Social Media Daniel Merrick University College Cork [email protected]

Abstract 4. Method Online social media streams such as Twitter have become There are five approaches in the methodology to achieve an increasingly popular method for users to communicate the previously outlined goals. For all the following meth- about a wide range of topics and track worldwide events. ods, the employed frequency scoring algorithm was in- While Twitter has become an excellent resource, it is spired by the classical tf-idf metric. (1) For each uni-gram hard not to be overwhelmed by the redundancy in the in a given tweet’s text, a frequency score relative to the key- ever growing stream of tweets. This paper describes a word is calculated. (2) Similarly a frequency score rela- methodology, which with minimal human guidance, ranks tive to the keyword is calculated for each bi-gram. All uni- tweets relevance to a user input query, using an array of grams and bi-grams have stop words removed using a filter- text mining and machine learning techniques, which can be ing process, which involves scraping 250 random Wikipedia used to create a keyword specific timeline. pages and collecting the most commonly occurring words as stop words. (3) The hash-tag entities for each tweet are 1. Problem Statement ranked based on a relevance frequency score. (4) The given tweet’s publisher id is used to test if the user has previously The aim of this research is to develop a methodology tweeted about the given keyword. (5) Finally, a random for- which effectively and efficiently, classifies and retrieves est, composed of 100 trees, built upon the occurrence of the tweets from a live stream or database for a given keyword. 150 most frequently used words for the given keyword and For example, we may be interested in retrieving posts re- a large set of similarly chosen words in a similar domain, is lated to a certain football team. The method should be able used to predict if the input tweet is a target or noise data. to return for an input tweet its probability of being related to the keyword. The methodology presented in this paper was developed on a corpus of tweets and can be trivially applied 5. Results and Discussion to a live stream. Results of tests on the efficiency of the features can be seen in Table 1. The table shows no over lap in users for the training and test data sets. This feature can be omit- 2. Related Work ted. The table also supports the argument that the use of A number of unsupervised approaches for event discov- bi-grams is more effective than uni-grams. The final ele- ery in social media feeds have been presented in recent ment of this project combines these methods in an ensem- publications. [1] compares six methods for event detection ble model which returns the probability of the input tweet which include ‘document pivot topic detection’. This pro- being related to the given topic, which can then be used to cess compares the cosine of the Term Frequency-Inverse gather and output relevant posts to a custom time-line. Document Frequency (tf-idf ) of a tweet to all previously Random processed tweets and clusters based on a threshold. The % Monograms Bigrams Hashtags Userid Forrest use of n-grams (n consecutive words) is proposed to im- Predicted 93.99 96.46 72.38 0 95.75 prove the quality of these methods. [2] shows the use of Correctly Incorrectly 3- or 4- grams achieving 3 times better performance when 2.91 1.77 0.9 0 2.6 predicted Target compared to uni-grams. My research employs similar ap- Incorrectly predicted 3.1 1.77 26.72 0 1.65 proaches to a supervised problem domain, where the end not Target user will have a specific event which they wish to track. Table 1: Accuracy of methods. 3. Data Collection References The tweepy python package was used to collect data [1] L. Aiello, G. Petkos, C. Martin, D. Corney, S. Pa- from the twitter API, which contains information about each padopoulos, R. Skraba, Goker,¨ I. Kompatsiaris, and A. tweet, including the publisher and the text body of the tweet. Jaimes. Sensing trending topics in twitter. IEEE Trans- Four million tweets were collected over a 4 day period us- actions on Multimedia, 2013. ing 21 sport buzz words (such as ’champions’) as a search [2] C. Martin, D. Corney, and A. Goker.¨ Finding newswor- query. Multiple queries were chosen to allow the model to thy topics on twitter. IEEE Computer Society Special exploit redundancy in the data. The data was then tokenized Technical Community on Social Networking E-Letter, as required by the respective steps in the method, hashtags 2013. and user ids were also parsed.

INSIGHT-SC [49] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Real-Time Predictive Analytics using Open Data and the Web of Things Wassim Derguech Eanna Burke Edward Curry National University of Ireland, Galway - E-mail firstname.lastname @insight-centre.org { }

Open Data Sources Sensor Data Abstract Weather Weather Weather Weather Data Energy Forecast Forecast Observation Observation sources Sensors This paper discusses real time predictive data analytics us- Source 1 Source 2 Source 1 Source 2 ing open data for decision support. Energy management use case: predicting energy usage using weather data. Web Crawlers Data Forecast Observation Crawlers Crawlers 1. Motivation Collectors Eurostat [1] estimates that the amount of the energy con- RDF Generators Data Forecast Observation Energy sumed by buildings in the European Union reaches 40-45% Convertors RDF RDF Readings RDF Generator Generator Generator of the total energy consumption; as such it must be ad- Data Storage dressed as part of an overall energy reduction in a wider Energy Forecast Observation Readings RDF RDF Storage RDF Storage economy. Thus, understanding and predicting energy use Storage Data helps for effective decision making towards a reduced en- Storage Triple ergy consumption. Store 2. Problem Statement The aim of this work is to use open weather data acces- Figure 1: Weather and Sensor Data Management. sible from free APIs provided by various online sources in Weather Energy Weather combination with building electricity use data from local Observations Readings Forecast sensors to predict future electricity use in an accurate and Error Reselection 2 comparison 1 Controller reliable manner. When working with data taken from sen- Module 3

sors connected to the Internet and other Web sources, the 7 6 5 4 Error data is from a 3rd party and the quality and reliability is DataBase outside of our control. This leads to the problem of having User Interface Source Selector

to pick data sources from the Web that best suit our needs. 1 Error % from error comparison module 5 SelectionResult Object broadcast

In addition, this must be done in such a way that the system 2 Error persisted in a database 6 SelectionResult Object broadcast 3 Error Query results 7 Remote procedure calls for displaying predictions, is reliable and continues making the best possible predic- Errors and other information tions over time so that prediction consumers can depend on 4 Flag to re‐select sources the quality of the predictions made. Figure 2: Source Selection and Predictive Analytics. 3. Related Work In the same context, the literature proposes various so- lutions [3, 2] for predicting energy use using weather data. The main problem with current solution is that they require a big historical data using a single data source. However we focus more on providing an autonomous system that pro- vides accurate results rapidly even without historical data. Additionally, we propose in our work a methodology that uses multiple sources of weather data. 4. Research Question This problem relates to two research questions: (1) How can we handle big Open data: collection, filtering and ware- housing? and (2) How to effectively select the right data source for effective analytics: prediction? Figure 3: The User Interface of the Developed System. 5. Proposed Solution We propose a two-step solution: (1) Open data manage- References ment: collection, filtering and warehousing (Fig. 1) and (2) [1] Energy: Yearly Statistics 2005. Office for Official Publica- big data analytics: source selection and prediction (Fig. 2). tions of the European Communities, 2007. [2] Y. Penya, C. Borges, D. Agote, and I. Fernandez. Short-term load forecasting in air-conditioned non-residential Buildings. 6. Evaluation In IEEE ISIE, 2011. The system has been developed and deployed within the [3] V. M. Zavala, E. M. Constantinescu, T. Krause, and M. An- DERI building. Fig. 3 shows its user interface depicting itescu. On-line economic optimization of energy systems us- also the error rates generated. The results show that a local ing weather forecast information. J. Process Control, 2009. weather station was more reliable than other resources.

INSIGHT-SC [50] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Reversible photo-actuated hydrogels for micro-valve applications Aishling Dunne, Larisa Florea*, Dermot Diamond Insight Centre for Data Analytics, National Centre for Sensor Research, School of Chemical Sciences, Dublin City University, Dublin 9, Ireland [email protected], [email protected], [email protected]

Abstract 2. Results In recent years, a popular way of photo-modulating In this study photo-actuator hydrogels were generated flow control in microfluidic channels has been through using a N-isopropylacrylamide-co-acrylated spiropyran- the use of acidified spiropyran (SP) hydrogels that co-acrylic acid (p(NIPAAM-co-SP-co-AA) copolymer, needed to be externally protonated with HCl in a 100-1-5 mole ratio. Different ratios of organic solutions.1,2 In the swollen protonated merocyanine solvent: water were used as the polymerisation solvent. (MCH+) form, the hydrogel blocks the channels and The organic solvents employed in this study were prevents flow. When exposed to white light, the tetrahydrofran (THF) and dioxane. The gel with the best positively charged MCH+ is converted to the uncharged response was then photo-polymerised in-situ inside a SP form, triggering shrinking of the hydrogel, and the microfluidic channel functioning as a micro-valve. channel opens. The addition of acrylic acid By varying the solvent ratio in the solvent mixtures, copolymerised within the hydrogel provides an internal hydrogels with different pore sizes and therefore source of protons that allows repeatable photo- different extent of swelling/shrinking and actuation actuation in neutral pH environments. Here we report kinetics were obtained. For example, when THF: water the effect of the polymerization solvent on the shrinking (4:1 v:v) was used as polymerization solvent, a and swelling kinetics of the photo-responsive hydrogel. remarkable contraction in hydrogel size of up to 50% Using this approach, reversible fast photo-actuated was obtained after four minutes of white light hydrogels have been obtained and have been irradiation (Figure 2). successfully used for micro-valves applications in micro-fluidic channels.

1. Introduction Previously photo-control of flow in microfluidics has been performed using acidified hydrogel networks. Figure 2. Shrinking and Swelling kinetics of 4:1THF:Water. Despite the attractiveness of photo-actuated polymer Hydrogel microstructures were photo-polymerised in- valves for microfluidics, this approach had great situ in PDMS/glass microfluidic channels for valve disadvantages, as such the requirement of strong applications. The hydrogels were photo-polymerised external acidic solutions to induce hydrogel re-swelling around 550µm diameter pillars using photo-masks. and slow re-swelling times of up to several hours. These When exposed to white light, the valves contracted, disadvantages have restricted the use of photo-actuated opening the channel and allowing fluid flow. The hydrogels to single-use applications. opposite was seen when the valve was kept in the dark More recently, acrylic acid (AA) has been incorporated (Figure 3). as an internal source of protons, removing the requirement of external acidic environments.3 In water, the acrylic acid comonomer dissociates, resulting in the protonation of the photochromic spiropyran (SP) to protonated merocyanine (MC-H+). This form is hydrophilic, allowing the hydrogel to swell. Exposure to white light promotes isomerisation of the MC-H+ Figure 3. Micro-valves ON/OFF demonstration form to the hydrophobic SP form, which triggers contraction of the hydrogel. (Figure 1) 3. Conclusion Optimization of the polymerization solvent mixture has resulted in the fabrication of hydrogels with faster and reproducible shrinking and reswelling cycles. This study demonstrates these hydrogels can be successfully used as photo-controlled valves in microfluidic systems for repeatable ON/OFF flow control in neutral Figure 1. Isomerisation of SP to MC-H+. environments. The polymerisation solvent has been shown to directly

influence the morphology of hydrogels, by producing 4. References porous hydrogels of different pore sizes. This has an 1. Sumaru, K., et al., Langmuir (2006) 22 (9), 4353. impact on the diffusion path length for water molecules 2. Taku, S., et al., Soft Matter (2011) 7. moving in/out of the hydrogel matrix, thus improving 3. Bartosz, Z., et al., Soft Matter (2013) 9. the swelling and shrinking kinetics of the hydrogel.

INSIGHT-SC [51] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Self-Propelled Ionic Liquid Droplets Wayne Francis, Larisa Florea* and Dermot Diamond *Insight Centre for Data Analytics, National Centre for Sensor Research, School of Chemical Sciences, Dublin City University, Dublin 9, Ireland. {wayne.francis,larisa.florea,dermor.diamond}@insight-centre.org Abstract Here we report self-propelled ionic liquid droplets Once released, the surfactant lowers the surface capable of controlled movement across the air/liquid tension of the aqueous solution, thus creating an interface. These droplets were guided to specific asymmetric surface tension gradient. This leads to Marangoni like flows which drive the droplet from destinations within open channels through the use of areas of low surface tension toward areas of high chloride gradients. The motion of these droplets stems + surface tension. The rate of [P6,6,6,14] release depends from surface tension effects due to the triggered release on the concentration of the chloride in the aqueous of surfactant from the droplet. + solution, as the formation of free [P6,6,6,14] (the active surfactant at the air-aqueous interface) through 1. Introduction dissociation of the relatively closely associated - Stimuli-responsive materials have gained much [P6,6,6,14][Cl] ions in the IL depends on the local Cl attention recently as new means for fluid flow control concentration at the IL-aqueous boundary [3]. Therefore droplets are guided to specific destinations within open within the microfluidic field. Controlling flow using - conventional pumps, valves and other physical actuators channels by creating Cl gradients for the droplet to can be costly and offers limited control over flow within follow. To date the methods for generating the Cl- gradients the chip. Flow can be potentially controlled by have been relatively short lived due to chemical integrating stimuli responsive droplets into the system. equilibrium, this limits the control over the speed of the This provides external control over fluid flow and droplet and the amount of the time droplets can be allows interesting advantages such as the potential for manipulated. Electro-generation of the gradients is the droplets to act as dynamic sensing vehicles, micro- proposed as this will potentially allow for on demand chemical reactors and micro-cargo carriers. Using gradient generation, while also maintaining the stimuli-responsive surfactants, several research groups gradients for longer periods of time. have developed smart droplets which are able to solve complex mazes [2] or that can be photo-manipulated, 3. Conclusion either guided or propelled by light [1]. The use of electro-stimulus allows on demand In this work stimuli-responsive novel self-propelled droplets capable of moving at the air/liquid interface are generation of gradients within microfluidic devices, developed and characterized. which broadens the potential application of these self- propelled droplets. In principle, this effect could facilitate many applications involving smart materials, 2. Results and Discussion such as programmed drug delivery from patches The micro-meter sized droplets used in this project through skin, smart wearable fabrics that respond were designed to move in an open fluidic channel and autonomously to changes in the local environment (e.g. were composed of the ionic liquid (IL) body temperature), or the realization of very low Trihexyl(tetradecyl)phosphonium chloride ([P 6,6,6, cost/low power (ideally zero power) autonomous 14][Cl]). The motion of these droplets was controlled by + chemical analysers capable of performing sophisticated the triggered release of the [P6,6,6,14] cation, a microfluidic management using chemistry to drive the component of the IL and a very efficient cationic processes, rather than conventional pumps and valves. surfactant (Figure 1).

4. References [1] A. Diguet, R.-M. Guillermic, N. Magome, A. Saint- Jalmes, Y. Chen, K. Yoshikawa and D. Baigl, "Photomanipulation of a droplet by the chromocapillary effect", Angewandte Chemie (International ed. in English) (2009) 9281-9284. [2] I. Lagzi, S. Soh, P. Wesson, K. Browne and B. Grzybowski, "Maze solving by chemotactic droplets", Journal of the American Chemical Society 132 (2010) 1198-1199. [3] D. Thompson, S. Coleman, D. Diamond and R. Byrne, "Electronic structure calculations and physicochemical experiments quantify the competitive liquid ion association and probe Figure 1: Diagram showing the composition of the [P6,6,6,14][Cl] stabilisation effects for nitrobenzospiropyran in + droplet and the relative solubility of the [P6,6,6,14] surfactant in phosphonium-based ionic liquids," Physical -2 -2 NaOH 10 M (left) and HCl 10 M (right) solutions. Chemistry Chemical Physics 13 (2011) 6156-6168.

INSIGHT-SC [52] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Social Media Feeds Clustering in Heterogeneous Information Networks Narumol Prangnawarat, Hugo Hromic, Ioana Hulpus, Conor Hayes National University of Ireland, Galway narumol.prangnawarat, hugo.hromic, ioana.hulpus, conor.hayes @insight-centre.org { }

Abstract benefits to the post clustering and ranking. One of the most Nowadays, social media plays an important role as source straightforward analyses for detecting discussion topics is to of information. However, users are exposed to a flow of in- cluster posts. By making use of the heterogeneous network, formation far beyond their cognitive abilities, leading to in- we plan to use and extend the state-of-the-art RankClus [3] formation overload. In this context, it is crucial to research algorithm. RankClus integrates clustering and ranking and ways of automatically organising social media feeds in or- is tailored for networks of multiple types of nodes. Figure der to support easy access to information that users find 1 illustrates a toy example of a heterogeneous network that relevant. While there are many approaches related to this contains users, tweets and concepts, and the relations be- problem, they mainly focus on one type of data (homoge- tween them. The two clusters on the left hand side of the neous) such as the text of the feeds, or the network of users figure can be detected with RankClus. Note that users can at a time. Our research focuses on combining these differ- belong to more than one cluster, for instance user U4 in the ent data types in a heterogeneous network. We propose to example. group posts by different topics and events, by analysing this heterogeneous network. In order to make the results of our clustering easily interpretable for users, we use semantic topic labelling to label the resulted clusters.

1. Motivation Currently, social media has a great influence in our daily lives. People broadcast events and news using social media. Automatic methods are required to handle this great amount of socially produced information. The first problem we ad- Figure 1: A toy example of a heterogeneous network between dress is how to automatically group posts based on their U:users, T:tweets and C:concepts. topics. The second one refers to ranking posts and users The concepts graph on the right side can be extracted with respect to their relevance to the identified topics. Previ- from DBpedia. It can be used in the clustering stage but ous approaches focus on clustering posts based on only one also it helps to understand and label the discussed topics type of data such as texts or the network of users. Our ap- within each obtained cluster. proach combines different types of data in a heterogeneous network. The analysis of this heterogeneous network would 4. Discussion allow us to better identify topics that are being discussed or We believe that considering the network connecting ele- emerging.To this end, our research focuses on how we can ments from more data types will improve social media post determine main discussion topics by analysing the hetero- clustering beyond current state-of-the-art. A very interest- geneous network connecting people, posts and concepts. ing research direction related to post clustering is event de- tection. We plan to apply the results of this work in order 2. Related Work to detect emergence of new clusters as well as changes in Hromic et al. [1] proposed a methodology for filtering, overall network properties that might indicate new events. grouping and ranking Twitter streams and provide tailored breaking news to end-users using solely the underlying user References interaction networks. On the other hand, Lau et al. [2] used [1] H. Hromic, M. Karnstedt, M. Wang, A. Hogan, V. Belak, and LDA (Latent Dirichlet allocation) topic models over only C. Hayes. Event planning in a stream of big data. KDML, the Tweets content for grouping and detection of events. In 2012. our approach, we leverage the idea of using connections be- [2] J. Lau, N. Collier, and T. Baldwin. On-line trend analysis with topic models: twitter trends detection topic model on- tween users, posts and concepts generated from the texts, line. COLING, 2012. all at the same time. [3] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: Integrating clustering with ranking for heteroge- 3. Proposed Solution neous information network analysis. 12th International Con- ference on Extending Database Technology: Advances in Our hypothesis is that the analysis of the heterogeneous Database Technology, pages 565–576, 2009. network connecting users, posts and concepts can bring

INSIGHT-SC [53] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Thematic Event Processing for Large Scale Internet of Things Souleiman Hasan Supervisor: Dr. Edward Curry Insight Centre for Data Analytics at the National University of Ireland, Galway [email protected]

Abstract of semantics can achieve high effectiveness and The Internet of Things (IoT) will connect billions of efficiencies in large scale event processing systems. devices to the Internet and form a large scale, heterogeneous, and dynamic environment of data 6. Proposed Solution producers and consumers. Event processing systems We propose an extension to current event processing can play a key role to enable information exchange in paradigm that is based on (1) a semantic model based such an environment for consumers to make sense out on terms statistical co-occurence in large textual of the data. Event processing is scalable by a decoupled corpora such as Wikipedia, (2) thematic tagging of model of interaction in space, time, and events and subscriptions, and (3) an approximate synchronization. However producers and consumers probabilistic matcher of events as shown in Figure 1 are coupled by event semantics. We propose a thematic [1], [2], [3]. event processing approach to loosen semantic coupling and enable large scale Internet of Things applications.

1. Motivation Smart cities are large scale environments of IoT sensors and city data consumers for traffic control, logistics, energy management, etc. Many situations of interest in these domains require information from multiple distributed sources. Different sensors might use the terms 'parking space occupied' and 'garage spot taken' to refer to the same event. Figure 1: Proposed thematic event processing.

2. Problem Statement 7. Evaluation It is challenging for data consumers to know and Experiments with events synthesized from real- have control over the environment in large scales and world deployments of smart city sensors showed that thus many applications become limited in reality. Such the proposed approach can achieve up to 85% of systems do not scale as the underlying assumption is matching quality of events and thousands of events per semantic coupling between the event consumers and second of throughput and at the same time following a producers. loosely coupled model of interaction [1] as illustrated in Figure 2. 3. Related Work Event processing systems have played a key role as a middleware for large scale environments. Nonetheless, content-based and concept-based event systems use semantically coupled models of interactions. They depend on implicit agreements on event types, attributes, and values or on the agreements on thesauri or ontologies. Such semantic models require granular and time consuming agreements on concepts, the thing that limits scalability for environments such as IoT. Figure 2: Results 8. References 4. Research Question [1] S. Hasan and E. Curry, “Approximate Semantic Matching of Events for The Internet of Things,” ACM Trans. Internet How to efficiently and effectively enable event Technol., 2014. (In Press) exchange in large scales and at the same time have [2] S. Hasan, S. O’Riain, and E. Curry, “Approximate Semantic loose semantic coupling between producers and Matching of Heterogeneous Events,” in 6th ACM consumers of events? International Conference on Distributed Event-Based Systems (DEBS 2012), 2012, pp. 252–263. [3] Souleiman Hasan, Kalpa Gunaratna, Yongrui Qin, Edward 5. Hypothesis Curry, “Demo: Approximate Semantic Matching in the The exchange of approximations of meanings rather COLLIDER Event Processing Engine,” in The 7th ACM than terms or words along with a distributional model International Conference on Distributed Event-Based Systems (DEBS 2013), 2013.

INSIGHT-SC [54] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Tracking and Recommending News Doychin Doychev [email protected]

Abstract more meaningful and useful news recommendations. In this work we are interested in tracking the evolution of news on the web, particularly real-time and social news- 3 Hypothesis feeds.Ultimately we are interested in developing new types We believe that social media channels can help us es- of recommendation technologies that are capable of recom- tablish better user preference profiles by looking into what mending the right news content to the right user at the right users and their friends (in the case of Facebook) like and time. share and what they tweet about, who do they follow and who are their followers (in the case of Twitter). 1. Problem Statement Recommendation techniques have been widely used in 4. Proposed Solution the news domain employing a variety of information - con- In order to build a better, personalized news recom- tent, user (collaborative), linked data, demographic and so- mender, such as that described by Phelan et al. [3], we need cial networks. Although most of these techniques have to take into account the user social profiles from Facebook, been evaluated offline in laboratory scenarios where effi- Twitter, LinkedIn, etc., to identify their professional and ciency and the dynamic nature of news consumption data leisure interests, and how they change over time. This pro- are not primary concerns, there has been a push recently to- file should be enriched with the user’s browsing behaviour wards evaluating recommenders in situ i.e. online, under and used as a filter to deliver only news stories to the user live conditions, for example [2]. The CLEF-NewsREEL that are relevant at the moment of browsing. Additionally, 2014 news challenge provides an ideal opportunity to test users should be able to track the evolution of the stories that a number of different types of recommender algorithms they are interested in and even of individual entities (people, over a spectrum of news providers spanning from domain- companies, product, events) that occur in them. specific news such as sports (sport1.de) through more gen- In the context of achieving our proposed solution we par- eral news (tagesspiegel.de) to online home and gardening ticipated in the CLEF-NewsREEL 2014 news challenges to stores (wohnen-und-garten.de). This work is one step to- evaluate the efficacy of 16 different recommendation strate- wards our bigger research goal of tracking emerging and gies [1] in a live user setting. We have also begun to look evolving news stories and providing the right users with the into tracking news stories by their entities (people, compa- right access at the right time. nies, places, etc.) to better identify relationships between news stories. 2 Research Question The benefits of this research can improve people’s lives In this research we have two high-level questions: how by providing the right users with the right access at the right do we track news stories and how do we recommend news time. stories. One of the unique properties of the news domain is the References fact that individual news articles can represent the evolution [1] D. Doychev, A. Lawlor, R. Rafter, and B. Smyth. An analysis of an underlying news topic or a story and news topics can of recommender algorithms for online news . In Proceedings impact on each other, hinting that we should detect and ex- of Conference and Labs of the Evaluation Forum, To appear ploit these relationships between stories. Moreover, news in 2014. stories are not persistent like movies or books - they usu- [2] B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. The plista ally have short life cycles and varying relevance or popu- dataset. In Proceedings of the 2013 International News Rec- larity. Answering the question of how we can track these ommender Systems Workshop and Challenge, pages 16–23. properties and make sense of them will give us a better un- ACM, 2013. [3] O. Phelan, K. McCarthy, and B. Smyth. Buzzer: Online real- derstanding about the changing news cycle as it exists in time topical news article and source recommender. In Pro- today’s mobile, always-on world. ceedings of the 21st National Conference on Artificial Intelli- Usually, recommender systems recommend news arti- gence and Cognitive Science, pages 251–261. Springer, 2010. cles as if they were independent entities, and don’t take into consideration how articles related to each other to form stories, or even how stories themselves may be connected. The key question we ask in this research is how we can exploit this deeper knowledge about the underlying rela- tionships between articles and stories to provide users with

INSIGHT-SC [55] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Tracking Breaking News on Twitter with Neural Network Language Models Igor Brigadir Insight Centre for Data Analytics, UCD [email protected]

Abstract Can changes in the contexts be used to enable more ef- Twitter is often the most up-to-date source for finding and fective tracking of an event over time? The hypothesis, is tracking breaking news stories. There is considerable inter- that a distributed semantic representaton of a query q and est among journalists in developing filters for tweet streams a sequence of documents d1...n will perform substantially in order to track evolving stories. This a non-trivial text better than a traditional bag-of-words representation in a re- analytics task as standard retrieval approaches fail when trieval setting. presented with short documents and rapidly evolving top- ics. We propose an adaptive text similarity mechanism for 5. Proposed Solution tracking breaking news stories. Evaluations based on the word2vec[4] is a neural network language model ROUGE metric indicate that this approach performs well (NNLM). This language model builds distributed seman- in tracking evolving stories on Twitter. tic represenations of words, represented as dense vectors of real values. Representations of phrases and whole tweets 1. Motivation can be created via element wise sum of these word vectors. Liveblogs and timelines are popular with many news out- The model is normally trained on large, static data sets, we lets for broadcasting breaking news stories. Manually cu- show that retraining the model using fresh data in a slid- rating these requires a considerable amount of effort and ing window approach allows us to create an adaptive way attention from journalists. Current search methods on twit- of measuring tweet similarity, by generating new represen- ter involve exact keyword matching, and do not adapt to tations of terms in tweets and queries at each time window. changes in language use. This requires users to adapt their queries manually as the story evolves over time. 6. Evaluation Our approach offers editorial support for the task, allow- Our generated timelines are evaluated using ROUGE[3], ing journalists in smaller news teams with limited budgets measuring overlap of terms with gold standard human cu- to work more effectively when creating timelines. rated timelines. Example output is made available1 with more examples shown in [1]. Our approach performs better 2. Problem Statement or as well as, an average human in terms of ROUGE scores. The problem of generating timelines for evolving break- Event: ROUGE-1 Recall ROUGE-1 Precision ing news events comprises of two tasks: Realtime ad-hoc Max Adap. Static tf Max Adap. Static tf retrieval - for each query, retrieve all subsequent relevant MH370 0.44 0.35 0.32 0.09 0.83 0.24 0.28 0.11 tweets from a stream, and timeline summarization - given Walker 0.55 0.30 0.05 0.09 0.69 0.20 0.14 0.07 retrieved tweets, remove redundant or duplicate informa- WWDC 0.35 0.13 0.06 0.01 0.53 0.50 0.49 0.13 tion while maintaining good coverage. Our main focus is Table 1: Scores for 3 sample events: MH370 disappear- on retrieval. ance, actor Paul Walker’s death and the Apple Worldwide Developers Conference. Max score is best possible time- 3. Related Work line given our data, Adap. and Static are Adaptive and static An approach in [5] deals with longer news articles, using NNLM approaches, tf is a simple term frequency represen- a Time-Dependent Hierarchical Dirichlet Model (HDM) to tation baseline. generate timelines from HDM topics for sentence selection. References Generating timelines using tweets was explored by Li & [1] I. Brigadir, D. Greene, and P. Cunningham. Adaptive repre- Cardie [2] but focused on generating timelines of events that sentations for tracking breaking news on twitter. In NewsKDD are of a personal interest, rather than breaking news events. Workshop Proceedings, 2014. [2] J. Li and C. Cardie. Timeline generation: Tracking individu- als on twitter. In WWW Conference Proceedings, 2014. 4. Research Question [3] C.-Y. Lin. Rouge: A package for automatic evaluation of In breaking news, the vocabulary used to describe events summaries. In Proceedings of the ACL-04 Workshop, 2004. can evolve as new facts emerge. How can we capture and [4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. represent these changes for better retrieval accuracy? The Distributed representations of words and phrases and their distributional hypothesis in linguistics says that words used compositionality. In NIPS Processings. 2013. in the same contexts tend to purport similar meanings. [5] T. Wang. Time-dependent Hierarchical Dirichlet Model for Timeline Generation. arXiv preprint arXiv:1312.2244, 2013. 1http://mlg.ucd.ie/timelines

INSIGHT-SC [56] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Ultimate Search: Visual Object Retrieval Zhenxing Zhang, Cathal Gurrin, Alan F. Smeaton INSIGHT Centre for Data Analytics Dublin City University, Glasnevin, Dublin 9, Ireland zzhang, cgurrin, asmeaton @computing.dcu.ie { }

Abstract dataset. It includes a reasonable number of real world im- Given a example image of real-world object, visual object ages and videos, enough well defined search tasks with retrieval aims to retrieve all those images which contain the ground truth data and standard scoring procedures for com- occurrences of the same object from a large data collection. paring results.The well-accepted measurement like preci- This research will not only help us to quickly obtain the rele- sion, recall and mean average precision will be calculated vant information from massive multimedia data collections, to evaluate the performance. but also benefit us in multimedia archive organization, vi- sual object localization and identification. 4 Experiments For the purpose of evaluation retrieval system perfor- 1. Introduction mance, we build up a visual object search engine follow- With the rapid growth of multimedia data collections, a ing three steps. Firstly in order to overcome the consistent wide variety of applications are needed to help users to or- changing in visual appearance, a robust and advanced im- ganize, browse and search the content. Currently most mul- age feature descriptor has been adopted and quantized to timedia applications are built to index and search based on visual words according to their Euclidean distance in vec- “Metadata” such as user-assigned tags or text description. tor space. Then an inverted indexing structure are gener- The aim of this research is to develop a multimedia search ated from those visual word to support real-time retrieval. engine based on visual content information and help users Finally, the same features will be extracted once query im- to find the relevant documents in real-time. We focus our ages arrive and a rank list will be generated according to the research on searching for a particular object instances such relevance with the query visual object. as landmarks, logos, special person, or book covers. The Figure 1. mAP between TRECvid12 and TRECvid13 research has a wide usage in multimedia applications, for 0.25

example, human face identification, mobile visual guide for 0.20 tourist, etc. 0.15

2. Challenges,Related Work,Proposed Solution 0.10

There are two main challenges in this research problem. 0.05 mean Average Precision Average mean Firstly due to the fact of variation in imaging conditions, 0.00 such as lighting changing, different scene, and/or varying iAD-12-1 iAD-13-1 iAD-13-2 iAD-13-3 iAD-13-4 shooting angle, the visual appearance of the same object will most likely changes between images. Secondly per- forming real-time retrieval in a dataset including over mil- Until today, we have participated two times in the lions of documents is a challenging task which require effi- TRECVid annual event and our performance improved sig- cient algorithms for indexing and memory management. nificantly both in precision and recall, comparing the sec- Inspired by text retrieval knowledge, [2] proposed a vi- ond participation with the first one(See figure 1) . sual object retrieval system based on representing visual ob- ject with a bag of visual words framework and adopted the References inverted indexing algorithm [1] to support fast search. How- [1] C. D. Manning, P. Raghavan, and H. Schutze.¨ Introduction ever the global weighting scheme ,TF-IDF , turns out not to to Information Retrieval. Cambridge University Press, New be optimized in a visual search scenario. Thus we argue York, NY, USA, 2008. that a Query-adaptive weighting scheme based on discrim- [2] J. Sivic and A. Zisserman. Video Google: Efficient visual inative machine learning can better measure the similarity search of videos. In J. Ponce, M. Hebert, C. Schmid, and between visual objects, hence improve the performance of A. Zisserman, editors, Toward Category-Level Object Recog- nition, volume 4170 of LNCS, pages 127–144. Springer, the retrieval system. 2006. [3] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns 3 Evaluation Framework and Datasets and trecvid. In MIR ’06: Proceedings of the 8th ACM Interna- tional Workshop on Multimedia Information Retrieval, pages We chose an open and well known evaluation campaign 321–330, New York, NY, USA, 2006. ACM Press. event, TRECVid [3], to provide us the main evaluation

INSIGHT-SC [57] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Use of Inertial Sensors and Depth Cameras to Classify Performance Using Functional Screening Tools Darragh Whelan; Martin O’Reilly; Eamonn Delahunt; Brian Caulfield Insight Centre for Data Analytics, University College Dublin [email protected]

Abstract comprehensive biomechanical analysis of human Athletic screening is an important component of sports movement. These can then be adapted to aid the performance and injury prevention. However, screening collection of data from screening tools. This contrasts can prove time consuming and the data collected is with similar work being completed within the personal often unreliable. This leads to inefficient time sensing group in Insight that will apply machine management and poor decision-making. Recent learning techniques to extract data from a minimal set developments in Inertial Measurement Units (IMUs) of body worn sensors to look at gym exercises. and Red Green Blue Depth (RGBD) cameras, such as the Microsoft Kinect, may allow for automated 5. Proposed Solution screening tools that can produce valid and reliable A suitability study involving 22 weight-trained data. This innovative work aims to use these low cost participants was conducted to ascertain whether IMUs sensor technologies in order to analyse human and RGBD cameras could be used to classify simple movement and develop such screening tools. biomechanical movements. The IMUs were placed on the lumbar spine, bilateral thighs and bilateral shanks. 1. Motivation Angular data were then derived from accelerometer and Preseason athletic screening has become gyroscope signals from the IMUs. Initial analysis of the commonplace. It is used in order to identify potential squat movement compared acceptable technique with risk factors that may lead to injury during the season, induced deviations, such as bent over and heels allow individualised conditioning programmes to be elevated, using paired t-tests. Significant differences in developed as well as acting as a baseline to compare the sensor output were found in a number of different athlete against in future. However screening can prove signals. This indicates that there is potential to identify onerous for athlete and clinician and the data collected and quantify deviations within the squat using IMUs. is often subjective in nature. However, recent developments in IMUs and RGBD camera technologies 6. Future Work have the potential to address these issues and enhance Further features from IMUs and RGBD cameras will the implementation of screening. be analysed to allow for the recognition and evaluation of squat technique. Other commonly used screening 2. Related Work tools such as lunges, deadlifts and tuck jumps will also Studies have shown that screening tools such as the be investigated using similar methods. Sensor fusion squat, functional movement screen (FMS) and tuck techniques, such as Kalman filters, will be used to jump may identify those at risk of future injury [1]. To maximise the accuracy of the movement analysis. This date, classification of performance using these tools has may be done by performing online calibration of sensor been subjective meaning those completing the screening errors automatically whenever measurements from must be skilled in biomechanical analysis. There have RGBD cameras are available [3]. It is hoped that this been a number of studies that have analysed the use of research will result in the development of automated inertial sensors and RGBD cameras in quantifying screening tools to detect potential deficiencies in form exercise training. This includes work on exercise and enable low cost biomechanical analysis. This can recognition, biofeedback and rehabilitation systems [2]. reduce the risk of injuries and improve sporting By building on this work, technologies may be performance. developed that prove useful in objectifying and interpreting data collected from screening tools. 7. References [1] K Kiesel, P.J. Plisky, M.L. Voight, “Can Serious 3. Problem Statement Injury in Professional Football be Predicted by a To date, screening athletes has proven time Preseason Functional Movement Screen?,” N Am J consuming and produced data that is difficult to Sports Phys Ther ., vol. 2, pp. 147-158 [2] O. M. Giggins, U. M. Persson, and B. Caulfield, quantify objectively. The consequences of this are poor “Biofeedback in rehabilitation.,” J. Neuroeng. time management and inadequate interpretation of data. Rehabil., vol. 10, pp. 60, 2013. [3] A Bo, M Hayashibe and P Poignet Joint angle 4. Hypothesis estimation in rehabilitation with inertial sensors and It is expected that, by using sensor fusion techniques, its integration with Kinect. EMBC 2011: Annual International Conference of the IEEE Engineering in inertial and RGBD sensor technologies will provide a Medicine and Biology Society

INSIGHT-SC [58] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Using Third Level Educational Data to Help “At Risk” Students Owen Corrigan Insight @ DCU [email protected]

Abstract these predictions to alert the students directly about how This project is about using moodle data to enable they will fail to change their patterns?”. Finally it will lecturers to identify students “at risk” of failing a be important to investigate what are the ethical module and enable them to intervene early. It also implications involved in using students data. investigates the possibility of directly incentivising the students through “gamification”. 5. Hypothesis Our hypothesis is that failures rates can be lowered 1. Motivation by using these predictive models to target the most “at Throughout their time in college, students generate risk” students. data by using infrastructure such as Content Management Systems (CMS) like Moodle and 6. Proposed Solution Blackboard, using college WiFi networks, the results of We have been working with DCU Information exams and library access records. This data could Systems and Services department to get access to potentially be used to help lecturers and administrators anonymised student Moodle log data and exam results. improve the students’ experience in college. First we built a model of which students are most likely As a motivating example, consider a lecturer who to fail. To do this we extract features from all available teaches a first year engineering course with 200+ data and use machine learning techniques. We train one students. This lecturer would like to reduce drop out or SVM for every week of student data for each module. failure rates from the course but does not have the We obtain a confidence score for classifying each resources to monitor individual students. It is important student in order to prioritize student interventions. We to know which students are at risk of dropping out as then export this information to a web application that early as possible. shows visualizations of the information to lecturers. The aim is to allow lecturers to be notified when a student 2. Problem Statement becomes “at risk”, and provide an easy way for them to Our challenge is to give lecturers and administrators intervene. tools to make effective interventions to individual One aspect that we are working on with students using the available data. In order for these tools psychologists is to “gamify” this for students by to work, they will need to convey clearly which generating a score for students that represent behaviors students are most at risk of failing. The lecturer should linked with passing the module. They then compete also have some understanding of how the tools create with other students to maximize this score. their predictions for them to have confidence in the system. The tools should be flexible enough to handle 7. Evaluation different types of interventions. The current state of the project is that our classifier has performed reasonably well for the task, although the 3. Related Work performance varies depending on the module. Using the In the last 15 years there has been a large amount of Area Under Curve (ROC AUC) metric, we can achieve activity in the field of Educational Data Mining and the a result between 0.6 – 0.7 midway through the semester more recent field of Learning Analytics [1]. One for some modules. We have created a “gamification” criticism of the field is that projects are often one-off web application which we are planning to run on trial analyses or projects that are abandoned after a year. next semester for up to 10 modules (1,500 students). Two examples of successful implementations of learning analytic systems are Purdue Signals [2] and a 8. References project done by the Marist College, USA [3]. [1] Ryan S.J.d. Baker, Kalina Yacef, “The State of Educational Data Mining in 2009: A Review and Future Visions”, JEDM - 4. Research Questions Journal of Educational Data Mining, Vol 1 No 1, pp. 3-17. [2] Matthew D. Pistilli, Kimberly E. Arnold, “In practice: The central question to be answered is “can we use Purdue Signals: Mining real-time academic data to enhance the data collected on students to make accurate student success”, About Campus, Wiley, pp. 22-24. predictions about whether they will fail a given [3] Sandeep M. Jayaprakash, Erik W. Moody, Eitel J.M. module?”. Given this, how can we convey this Lauría, James R. Regan, and Joshua D. Baron, “Early Alert of information to lecturers effectively to allow them to Academically At-Risk Students: An Open Source Analytics make interventions?. A related question is “can we use Initiative”, Journal of Learning Analytics, Vol 1 No 1, pp. 6- 47.

INSIGHT-SC [59] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Variable Selection for Categorical Data Clustering Michael Fop, Thomas Brendan Murphy University College Dublin [email protected]

Abstract z z A variable selection method for latent class cluster analy- M1 M2 sis is proposed, in which the usefulness of a variable is as- sessed comparing two models. The method is capable of XR XC discarding not only those variables that do not contain any XC XP XC ⊆ XP information about the clusters, but also those that are su- perfluous, leading to a parsimonious selection. An appli-

cation to medical data related to musculoskeletal pain is O O presented. X X

Figure 1: The two competing models. 1. Clustering categorical data To overcome the problem, in model we let the pro- M2 Latent class analysis (LCA) is a model-based approach posed variable to be related to a subset XR of the clustering for clustering multivariate categorical data, that is data ari- ones [3]. Therefore XP would be discarded because does sing from surveys, DNA sequence analysis, etc. not contain information about the cluster or because it is su- Considering the data X observed on K categorical vari- perfluos. In this way we achieve a parsimonious selection, ables, the LCA model assumes that each observation comes outlining a realistic and flexible framework. from one of G classes (groups), identified by an unobserved The models are then compared via an approximation to latent indicator variable z of cluster membership. Then their posterior odds, p( 1 X)/p( 2 X), and a backward- M | M | within a class the variables are assumed independent and stepwise greedy algorithm is used to conduct the search each one is modelled through a multinomial distribution. over the model space. The algorithm returns the optimal combination of number of groups and clustering variables. 2. Variable selection in model-based clustering As usual in a multivariate setting, could be the case that 4. Musculoskeletal pain data not all the variables in hand are useful to detect the group The data consist in a list of 36 clinical criteria related structure in the data. This situation gives rise to the problem to Presence/Absence of specific traits and symptoms in 425 of selecting the subset of relevant variables and discarding patients of St. Vincent’s Hospital in Dublin. Each patient those variables which are redundant or do not contain any suffers of low back pain disorders and was assigned by a information about the groups. Recently the problem of vari- group of experienced physiotherapists to one of the cat- able selection has been recast as a model comparison prob- egories of pain: Nociceptive, Peripheral Neuropathic and lem in which the competing models are defined according Central Sensitization. There is interest in finding the subset to the relationship between the candidate variables and the of clinical criteria that best discriminate between the three latent variable z. This approach has been shown to outper- classes of pain, assigning the patients to them in advance of form the regularization approach to variable selection [1]. the experts judgement. After applying the variable selection method we select 11 criteria, partitioning the data into 3 clusters. The clus- 3. Variable selection method for LCA ters found agree with the expert-based classification of the Following the model comparison approach of [2], at each patients with a rate of disagreement of 8.71%. step of the procedure we split the data in: XC , the current set of variables relevant for clustering; References • XO, the set of variables not relevant for clustering. [1] G. Celeux et al. “Comparing Model Selection and Regu- • larization Approaches to Variable Selection in Model-Based P Then a variable X is proposed to be removed from or Clustering”. Journal de la Société Française de Statistique, C added to X and the decision is made by comparing two 155(2):57–71, 2014. models: , in which the proposed variable contains use- [2] N. Dean and A. E. Raftery. “Latent Class Analysis Variable M1 ful information, and , in which it does not (Fig.1). In [2], Selection”. Annals of the Institute of Statistical Mathematics, M2 XP is assumed independent from XC . But if XP is re- 62(1):11–35, 2010. [3] C. Maugis et al. “Variable Selection in Model-Based Clus- lated to some or all the variables in XC , this assumption tering: A General Variable Role Modeling ”. Computational XP can wrongly lead to declare as relevant even if it actu- Statistics and Data Analysis, 53(11):3872–3882, 2009. ally does not contain further information about the clusters.

INSIGHT-SC [60] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Deep Learning for High-Dimensional and Sparse Clinical Study Data J O’ Donoghue, M Roantree School of Computing, Dublin City University E-mail: jim.odonoghue, mark.roantree @insight-centre.org ∗ { }

Abstract previous DL algorithms at classification, classification with Characteristics of many datasets often render data mining missing inputs and standard imputation tasks . with traditional techniques intractable, or severely limit the accuracy of models learned. In this research, we exam- 4. Research Question ine a complex method of data mining, called deep learn- Can a DL algorithm be applied to high-dimensional and ing, to overcome the problems of high-dimensionality and sparse clinical data in order to build a more accurate predic- sparsity, and determine if higher degrees of accuracy can tive model, that outperforms its shallow counterparts? be achieved. 5. Hypothesis 1. Motivation Deep learning algorithms can be successfully extended Traditional mining algorithms can be classed as shallow and applied to clinical datasets to overcome the problems learners, as they consist of one or two layers of learning of sparsity and high-dimensionality modelling datasets that functions [1]. The properties of these algorithms are limited some current shallow architectures cannot, or when shallow in dealing with high-dimensionality and sparsity i.e. data architectures can perform this modelling task, deep learning with many features and a high number of missing values. can build more accurate models. State of the art in machine learning is held by deep learn- ing (DL) algorithms [4], which have three or more function 6. Proposed Solution layers. DL algorithms have successfully modelled high- Our solution applies deep learning to longitudinal clini- dimensional and sparse data in the field of computer vision cal trials which take biometrics on individuals at time-points [2], but have not been widely applied outside this area. over a period of years. Based on existing deep learning ap- High-dimensional and sparse data was encountered in proaches [1] our solution is a software architecture whose validating the In-MINDD (Innovative Mid-life INtervention basic components are the Data Transform, Learner for Dementia Deterrence) model on the Maastricht Age- and Layer classes. It allows for the application of different ing Study (MAAS). Found by clinical research colleagues DL architectures from a single implementation to a dataset in Maastricht University through a systematic literature re- agnostic of the data source. view, the model predicts dementia risk in old-age for those in mid-life. and its validation on an ageing related dataset is necessary to corroborate the result of the literature review. 7. Evaluation Precision, recall and F1 metrics will be employed to test the accuracy of the model learned. Features learned through 2. Problem Statement deep learning will be compared with those found through Traditional methods to deal with high-dimensional data, a Principal Components Analysis (PCA). For features in- such as hand-crafting features or feature selection require ferred we will enlist clinical research project partners in either supplementary algorithms to the initial learner or Maastricht to identify if the features make medical sense much manual effort, which is an undesirable practice and in the context of a participant’s other measures. could lead to reduction in accuracy [1], [3]. Missing data in sparse datasets is often dealt with using References simple algorithms to infer missing values, or by deleting the [1] Y. Bengio, A. Courville, and P. Vincent. Representation learn- sparse record entirely, leading to the loss of important data. ing: A review and new perspectives. 2013. [2] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi- 3. Related Work prediction deep boltzmann machines. In Advances in Neural Information Processing Systems, pages 548–556, 2013. Humphrey’s et al. [3] showed that a DL network could [3] E. J. Humphrey, J. P. Bello, and Y. LeCun. Feature learning learn the features most relevant to an outcome in a high- and deep architectures: new directions for music informat- dimensional music dataset and that their model beat the pre- ics. Journal of Intelligent Information Systems, 41(3):461– ceding state of the art in this music information retrieval 481, 2013. task. The deep algorithm used by Goodfellow et al. [2] suc- [4] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regu- cessfully modelled a sparse dataset and even outperformed larization of neural networks using dropconnect. In Proceed- ings of the 30th International Conference on Machine Learn- ∗Research funded by the European Union Seventh Framework Pro- ing (ICML-13), pages 1058–1066, 2013. gramme (FP7/2012) under grant agreement no. 304979.

INSIGHT-SC [61] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Development of an Analytics Framework for User-Generated Multi-Lingual Data John Lonican, Brian Davis, Siegfried Handschuh, Conor Hayes National University of Ireland, Galway {john.lonican, brian.davis, siegfried.handschuh, conor.hayes}@insight-centre.org

Abstract the data to annotate specific tokens of interest. In this extended abstract, we present a high level These useful tokens are later used by JAPE2 (a overview of a social network analysis and content finite-state transducer which performs regular profiler for persons, locations and organizations, expression-like operations over annotations) transducers utilizing information extracted from user-generated to extract the targeted information using a rule based social media content. approach. Once extracted, the information is visually represented in the SociaLens web platform3. 1. Introduction With the abundant use of social media websites such as Facebook, Twitter and Youtube, there exists a large wealth of user-generated content. This content (data) can often be invaluable to companies worldwide. Wegener and Sinha of BAIN & Company, a management consulting firm, suggest that companies that are good at data analytics are more likely to be successful, stating that they are “Twice as likely to be in the top quartile of financial performance within their industries” and “Five times more likely to make decisions faster” [1]. Figure 1: Extracted information as visualized Information such as that disseminated by social in SociaLens media users can provide valuable insights to a company, allowing it to adapt and improve it's services The data is treated differently depending on the based on user input, providing it with direction. This language detected. The pipeline was designed to extract information is valuable and there is a need for tools to detailed information such as relationships and extract this information and present it in a meaningful sentiments from specifically targeted languages such as way. English and German. Less targeted languages such as We have been collaborating with an international Chinese and Spanish also yield useful, but less detailed industry partner in the development of a prototype information such as key terms and named entities. multi-lingual information extraction and visualization The pipeline is expandable to discover more detailed platform in response to this need. information and for more languages.

2. Goal of Collaboration 4. Use Cases To develop a framework for the extraction and The extracted information displayed in SociaLens visualization of information derived from unstructured, can serve to make sense of a lot of hidden information multi-lingual social media content. from social media content. The uses of this kind of information can be wide-reaching allowing users to 3. Implementation Overview locate and realize potentially important strategic The Information Extraction pipeline was developed information such as sentiment trends, competitors, using GATE [2], a framework for developing events and key people in a domain. applications for natural language processing. As the raw data passed to the pipeline is multi- 5. References lingual, it is important to identify the language used in [1] R. Wegener, V. Sinha, “The value of Big Data: How the data as not all languages can be processed in the analytics differentiates winners”, same manner. For our purposes, the Java Language http://www.bain.com/publications/articles/the-value-of- Detection library1 was suitable for this task. big-data.aspx, accessed 18 Aug 2014 Once the language of the data has been detected, [2] H. Cunningham, et al. Text Processing with GATE standard preprocessing techniques are used. Gazetteers (Version 6). University of Sheffield Department of (lists of specific terms to annotate) are then applied to Computer Science. 15 April 2011. ISBN 0956599311

1 https://code.google.com/p/language-detection, accessed 18 Aug 2014 2 http://gate.ac.uk/sale/tao/splitch8.html, accessed 18 Aug 2014 3 http://socialens.deri.ie, accessed 18 Aug 2014

INSIGHT-SC [62] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Diversity-aware Top-N Recommender Systems Andrea Barraza-Urbina, Conor Hayes Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland [email protected], [email protected]

Abstract recommendation list evaluation is performed as an Recommendation Systems (RS) have emerged to guide aggregate of the individual scores of items, disregarding users in the task of efficiently browsing/exploring a the real value of items in the context of the list [2]. large product space, helping users to quickly identify The study of diversity-aware RS has become an interesting products. However, suggestions generated important research challenge in recent years, drawing with traditional RS usually don’t produce diverse inspiration from diversification solutions for results, even though it has been argued that diversity is Information Retrieval such as clustering and re-ranking. a desirable feature. Our work aims to study current Many approximations have been proposed both for diversification techniques for RS and the impact of diversification techniques and for the creation of offering diverse results on user satisfaction. metrics to measure diversity. Ziegler et al. [5] propose the Intra-List Similarity Metric and the Topic 1. Motivation Diversification technique. Adomavicius et al. [1] Recommendation Systems (RS) help users find in propose and evaluate several re-ranking methods that less time useful products. Traditionally, RS are assessed consider factors such as item popularity to increase with accuracy metrics. But, accuracy alone is not a clear diversity but maintain accuracy. Zhang et al. [4] aim to indicator of RS quality [2]. Other characteristics such as partition the user profile into clusters of similar items novelty and diversity should also be evaluated. and suggest to the user a list of items that match these Specifically, considering diversity is a challenge for RS, clusters. Vargas et al. [3] define a formal framework for which tend to offer users lists composed of similar the definition of novelty and diversity metrics. items (e.g., the user loves Star Trek so receives However, despite all current advancements, diversity suggestions of only Star Trek related items [2]). A for RS remains an open field of research. diversity-aware RS aims to reduce the redundancy in recommendation lists by offering users a range of 3. Research Goals options, not a homogeneous set of alternatives. Adding We plan to address the following research questions: diversity to RS can: (a) encourage product discovery by Does adding diversity have a positive impact on user incentivizing users to explore unknown sections of the satisfaction? Which trade-off between accuracy and catalog, (b) cover a wider spectrum of user preferences, diversity do users prefer? What are the characteristics and (c) respond to ambiguous user preferences with a of diversity that can be parameterized? list of varied items, thus increasing the chance the user In order to tackle these questions we plan to: will like at least one item. Diversity can be detrimental (i) Identify the most relevant diversification techniques to RS accuracy but can have a positive impact on user and metrics to measure diversity-aware RS quality. satisfaction [2][5]. This project focuses on offering (ii) Identify the parameters of diversification techniques solutions to challenges surrounding diversity-aware RS. that can be adjusted/tuned in accordance to user and/or domain characteristics. 2. Background and Related Work (iii) Design a RS that implements the diversification A recommendation is a set of N items ordered to techniques (with their parameters) to generate top-N maximize the utility/value of items for the user. Utility recommendation lists. is represented by a score or rating assigned explicitly by (iv) Validate the system with a functional prototype the user or estimated by the RS. The recommendation carrying out user evaluations. problem is centered on the prediction of the score/utility for unrated items. To achieve this, traditional RS 4. References techniques are based on two heuristics: (a) Content- [1] Adomavicius, G., Kwon, Y. “Toward more diverse based: users like items similar to those he/she has liked recommendations: Item re-ranking methods for recommender systems.” Proc. of WITS’09. in the past (b) Collaborative Filtering: users like items [2] Mcnee, S.M., Riedl, J., Konstan, J.A.. “Being accurate is that other users with similar preferences have liked in not enough: how accuracy metrics have hurt the past. In order to evaluate the quality of a RS, recommender systems.” Proc. of CHI’06. ACM, NY. accuracy metrics are used to measure the ability of the [3] Vargas, S., Castells, P. “Rank and relevance in novelty RS to predict the rating for unrated items. and diversity metrics for recommender systems.” Proc. of RS do not offer diverse recommendations naturally RecSys’11. ACM, NY, pp. 109-116. for the following reasons: (a) the heuristics that lay [4] Zhang, M., Hurley, N.. “Novel Item Recommendation by foundation to RS techniques are based on similarity User Profile Partitioning.” Proc. WI-IAT’09, pp.508-515. measures, (b) traditional evaluation metrics encourage [5] Ziegler, C.N., McNee, S.M. et al. “Improving recommendation lists through topic diversification.” accuracy but penalize diversity [2], and (c) Proc. of WWW’05. ACM, NY, pp. 22-32.

INSIGHT-SC [63] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Educational Data Analytics John Brennan Dublin City University [email protected]

Abstract How can we present this data to lecturers and Moodle is a Virtual Learning Environment and an students in an informative and ethical way? important learning tool for students in university. It allows students to access information about their 5. Hypothesis course and the supporting lecture material. There are certain features found in Moodle that give We examine how we can utilise Machine Learning on an indication whether a student will pass or fail a Moodle to derive important attributes about a student's module. performance. This can also be used to provide lecturers Providing information about students' predicted and other relevant parties’ early warning flags for results will improve pass rates and decrease dropout students who are performing poorly, allowing for rates. intervention. Using this data, it also opens up a realm to 'gamify' education, giving an incentive to students to improve 6. Proposed Solution performance. On a weekly basis extract the data from Moodle and analyse using a Support Vector Machine. Then compress the results to a confidence score and 1. Motivation a prediction. This is delivered to lecturers and other The first few weeks for a student who is new to third relevant parties through an application programming level education is regarded as the 'make or break' interface for the use of monitoring student progress. period. With the large scale of a third level educational Information will also be given to the students in the setting, it is currently difficult to ensure a student is form of weekly emails containing visuals and written settling in. feedback regarding their performance. Students can only be officially identified as being at Use the system as a pilot scheme for the 2014/2015 risk when they have failed a certain number of modules academic year in DCU. in their first semester. Using Moodle access logs, it can give an indication as to how a student is interacting with his/her course 7. Evaluation and college life. By using cross-validation, we found that the Being able to classify underperforming students and classifier was performing to a high standard for intervene early would result in decreased dropout rates. historically high failure rate modules. Our results show the number of accesses in a week, where and when Moodle is accessed and how a student interacts with 2. Problem Statement resources (assignments, lecture notes) are examples of Provide a way to identify students having difficulty key features in determining a student's pass/fail by extracting relevant data from records that classify prediction. students into pass/fail categories using Moodle logs. The accuracy of the model will be somewhat skewed Present the data in a suitable and comprehensible due to the excepted behaviour change with the format for relevant parties. Address the ethical gamification aspect of the pilot. questions that arise by obtaining confidential student data. 8. Future Work If the pilot study is a success, other features beyond 3. Related Work Moodle could prove useful. Library access logs, clubs A pervious study has been undertaken to understand and society’s membership, gym access logs, general student behaviour by mining data from Moodle. The Internet usage on campus could all be used to deduce study on Moodle feature extraction has reinforced the whether a student is going to college and attending importance of the logs [1]. In Marist College, USA, lectures. This could all strengthen the classifier. Educational Data Analytics has been successfully implemented [2]. 9. References [1] Kevin Casey, Paul Gibson, “(m)Oodles of Data: Mining 4. Research Question Moodle to understand Student Behaviour”, International Given a set of access logs from Moodle, are there Conference on Engaging Pedagogy 2010 (ICEP10) features that contribute to a student passing or failing? [2] Sandeep M. Jayaprakash, Erik W. Moody, Eitel J.M. Lauría, James R. Regan, Joshua D. Baron, “Early Alert of What can be deduced from the data using Machine Academically At-Risk Students: An Open Source Analytics Learning techniques? Initiative”, Journal of Learning Analytics

INSIGHT-SC [64] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A Keyword Sense Disambiguation Based Approach for Noise Filtering in Twitter

Sanjaya Wijeratne∗ Bahareh R. Heravi Kno.e.sis Center, Wright State University – USA Insight Centre for Data Analytics – NUIG [email protected] [email protected]

Abstract in the tweet’s context Ct (Ct is defined in Section 6), which In this paper, we describe an approach to filter out noisy will then be used to classify t. data generated by keywords-based tweet filtering methods by performing Word Sense Disambiguation on those key- 4. Research Question words used to collect tweets. We present the noise filtering We attempt to address how different senses of keywords problem as a binary classification problem and discuss our in P can be used to determine whether t is noisy or not. evaluation strategy which is to be carried out in future. 5. Hypothesis 1. Motivation Assume we have collected a tweet t, and know K and S+, With growing popularity of streaming social media plat- determining the senses of all keywords present in t which is forms such as Twitter for news reporting, locating timely P K can be used to determine whether t was intended to ⊂ and newsworthy information from them has become an es- collect by the user using P or not. sential step in Digital Journalism. Journalists use keywords- based tweet filtering to locate tweets created by eyewit- 6. Proposed Solution nesses in order to create news stories. Keywords-based For each tweet collecting keyword ki K, we extract 1 ∈ tweet filtering also brings a lot of irrelevant tweets as well. it’s senses from BabelNet to generate S. Si is the set of all For example, a journalist who uses the keyword ‘shoot’ to senses of k K. For each k K, the user will select a set of i ∈ i ∈ find information about shooting incidents around the world senses S + S that made the user to pick k as a keyword, i ⊂ i i via Twitter would get irrelevant tweets about photo/video which helps us to understand what senses of ki would bring shoots and football goal shoots because of the ambiguity of interesting tweets to the user. All senses of a keyword ki the term ‘shoot’. The motivation of this work is to help jour- that are not selected by the user S – S are considered as i ⊂ i nalists to find newsworthy content that interest them (tweets senses that could bring noise for k . For each sense s S i i ∈ i that are not noisy) from Twitter by filtering out noisy tweets of ki, we generate a list of associated words (stopword re- collected by keywords-based tweet filtering. moved and stemmed) using BabelNet synsets, glosses, en- tities and their types, which act as the context Csi for each sense s S of k . Given a tweet t, we identify entities, their 2. Problem Statement i ∈ i i Let K be the set of all keywords used to collect T, which types, and remove stopwords and stem the remaining words is the set of all tweets collected. Let S be the set of all senses to generate the context of the tweet Ct. For each keyword pi (glosses) for all keywords in K. Let S+ S be the set of all P in t, we disambiguate and assign the best sense to pi us- ⊂ ∈ 2 senses that could collect interesting tweets to the user for all ing Simplified LESK algorithm by calculating the overlap k K. Let P K be the set of all tweet collecting keywords of each keyword’s sense’s context Csi with Ct. If the sense i ∈ ⊂ – present in t. Given a tweet t T, K and S+, how do we assigned to keyword pi is from Si , t will be classified as a ∈ determine whether t is an interesting tweet to the user. In noisy tweet for pi and will be filtered. other words, can we determine t is not a noisy tweet given K and S+? Hence this is a binary classification problem. 7. Discussion on Evaluation We plan to evaluate our approach using randomly se- 3. Related Work lected tweet samples on each keyword that we used to col- lect tweets. We will manually remove any duplicate tweets Tweet classification in information filtering is a challeng- in them. Accuracy will be measured on how precisely our ing problem because general text classification methods fail approach identifies noisy tweets. In our initial evaluation for to address the problems with sparsity and non-standardized keyword ‘shoot’ with a sample set of 100 tweets (66 noisy), language used in tweets[1]. People have used supervised, we achieved 89% accuracy in removing noisy tweets. semi-supervised and semantic relationships-based classi- fication approaches to address the problem of short text classification[1]. But, according to the best of our knowl- References edge, this is the first attempt of using word sense disam- [1] G. Song, Y. Ye, X. Du, X. Huang, and S. Bie. Short text classification: A survey. Journal of Multimedia, 9(5):635– biguation on the set of tweet collecting keywords that are 643, 2014. present in a tweet t to determine the intention of using them 1http://tinyurl.com/BabelNet 2 ∗This work is from author’s ongoing internship at Insight Centre. http://tinyurl.com/SimpleLESK

INSIGHT-SC [65] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Low Cost Motion Capture Integrated into Home Based Serious Games Andrew Daly, David S. Monaghan Insight Centre for Data Analytics [email protected], [email protected]

Abstract cost sensor in serious games can enable the use of those The research presented in this abstract concern the games for rehabilitation by creating both personalized utilization of Serious Games in terms of home-based games and also making more widely available than patient rehabilitation. The main goal here is to attempt to other expensive alternatives. provide motivation and enjoyment during the performance of exercises while using the low-cost Microsoft Kinect 6. Proposed Solution integrated into a customized open sourced game engine.

1. Motivation Serious games can be thought of as any game based interfaces that have been designed for any purpose other than entertainment. Serious games have been researched in the areas of: military, health, government and education. The design of these serious games can offer valuable contributions to develop effective games in the area of rehabilitation. As any other computer Figure 1: Screenshot from the ‘power-up’ element of the game, they are fundamentally intended to capture and game. keep a person’s attention. In this work we have developed an early stage game on the open sourced Unity3D game engine and 2. Problem Statement integrated it with the Microsoft Kinect motion capture Patients that require rehabilitation for balance re- device. The objective of Game is to allow the patient to training, rheumatoid arthritis rehabilitation, and perform their rehabilitation exercises in a fun rehabilitation following stroke, etc. must perform ‘gamified’ environment. This is achieved by powering consistent exercises as a crucial element in their overall up a turret and shooting targets. However many things physical and mental rehabilitation. However patients at must be done to gain ammunition for turret and power home tend to either only follow their programmes for a up turret. The Game sequence is as follows, once the short period of time or do not follow them at all. The game starts, a warmup round, where the user must problem we are aiming to solve is that patients follow perform basic movements like capturing orbs. After that their exercise programmes regularly and to view it as a exercises must be performed, which are automatically fun activity. detected by the system, to power up the turret and acquire ammunition (Figure 1). Once the exercises are 3. Related Work performed the turret can be used. After each round of Lots of research has been conducted in using serious firing turret and acquiring ammunition for turret, games in terms of rehabilitation. Effective rehabilitation difficulty increases on powering it up and firing at must be early, intensive and repetitive [1]. Serious targets. games provide a means to maintain motivation for people undergoing therapy [2] by means of exercises. 7. Evaluation Games of virtual reality and imaging of webcam-based The research presented here reflects an initial three games [3] are usually the solution to provide an months investigation into this project. In the future engaging and motivating tool for physical patient usability is planned to stress test the technology rehabilitation. Most of the systems presented in the in order to produce fully working prototype. literature are very specialized to a particular condition and most have expensive hardware requirements. 8. References [1] REGO, P., MOREIRA, P.M. and REIS, L.P., 2010. 4. Research Question Serious games for rehabilitation: A survey and a classification Can the use of low cost sensors in serious games towards a taxonomy, Information Systems and Technologies (CISTI), 2010 5th Iberian Conference on 2010, IEEE, pp. 1-6. address the problem of patient adherence to exercise [2] Burke, J.W, Serious Games for Upper Limb Rehabilitation rehabilitation programs in home based scenarios? Following Stroke, In Games and Virtual Worlds for Serious

Applications, 2009. VS-GAMES '09., IEEE, pages 103 - 110, 5. Hypothesis Coventry, 23-24 March 2009 This research will provide a first step towards proof- [3] Burke, J.W, Augmented Reality Games for Upper-Limb of-concept, that serious games are a crucial cog in a Stroke Rehabilitation, In Games and Virtual Worlds for home rehabilitation system by utilizing low cost home Serious Applications (VS-GAMES), 2010 Second based motion capture devices, which may already be International Conference, IEEE, pages 75 – 78, present in households, with the Kinect. The use of low Braga, 25-26 March 2010

INSIGHT-SC [66] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

“Measuring Scientific Impacts of Research through Altmetrics” Mohan Timilsina,Vaclav Belak, Conor Hayes National University of Ireland, Galway mohan.timilsina, vaclav.belak, conor.hayes @insight-centre.org { } Abstract 5. Hypothesis The impact of scientific research is only realized if it is com- Our hypothesis is how to use a social media as a new municated among broader community. In the era of social opportunity to leverage the citation work more broadly and web, the social mention of scientific publications appears to generating an impact through them. In the figure below are be complementary to traditional bibliometric impact analy- sis. The impact of scholarly publications is not only limited Research to the citations in the academic world but also in the so- cial web. Such online metrics which measure the impact of scholarly publications has given rise to the new area of metrics called ”altmetrics”. Social Media Public Policy

1. Motivation The research policy maker, funding agency, university Figure 1: Indicators of research impact. and the government are taking decisions in terms of tra- the indicators of the research impact. Research cannot exist ditional citation counting, peer review and journal impact in vacuum nor does the impact. The impact is generated in factors of scholarly research. Any peer reviewed publica- public policy and social media. The impact will not only be tions requires a longer time to accumulate citations because quantitative as citation counts but also qualitative for exam- it has to wait for other publications to get cited. Alterna- ple extraction of stories or impact of the person in a partic- tively the web provides a platform for the discussion and ular context. diffusion of scholarly information. In this case ”Altmet- rics” can be very useful to analyze a publication’s impact in 6. Proposed Solution social media like Twitter, Facebook, research blogs, main To defend this idea we are running an experiment in 1 stream news and public policy documents providing an in- spinner data which we have collected for the time period stantaneous and more broad picture of research impact. of 2010 Nov to 2011 July. We are particularly looking at Avian Influenza and Haiti Cholera related information in the 2. Problem Statement mainstream news and the blogs because they were highly Citations of new scholarly publications in a formal aca- debated topics during that time. What we want to see is demic discourse takes a long time and as a result it is hard the scholar activity of researchers by analyzing the network to gauge their immediate impact. This research is focused of hyperlinks from retrieved web documents. Currently we upon identifying such scholar activity online, monitoring have also extracted the list of relevant English journals from scholarly publications and promoting them to be a highly the DBpedia source. We are going to match these journal cited papers in the future. listings with the web documents that we extracted. After we have the representative set of the hyperlinks we will run a network analysis using in degree counts of the each nodes 3. Related Work for citation measure and community analysis algorithm to Moed [1] has mentioned citation analysis in his research identify the influential scholar activity in the network. assessment to measure the impact or influence of the schol- arly work. Similarly Thelwall et al.[2] has pointed out the new shift of traditional impact analysis through bibliomet- 7. Conclusion rics into webometrics. In this research we are trying to mea- Research Impacts are quantitatively analyzed using cita- sure the qualitative impact of research through altmetrics. tion counts. Our research will focus on the qualitative as- pect of research impact using the current state of the art analysis of mass media and public policy documents. 4. Research Question The goal of this research is to measure the impact through altmetrics and is driven by the following research References [1] H. F. Moed. Citation analysis in research evaluation, vol- questions. ume 9. Springer, 2006. 1. How research impacts are generated ? [2] M. Thelwall. Bibliometrics to webometrics. Journal of Infor- mation Science, pages 1–18, June 2008. 2. How to measure such impacts in web documents across different domains ? 1http://www.spinn3r.com/

INSIGHT-SC [67] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Modeling Causality for Big Data Dalila Messedi, 2nd year student at Telecom SudParis, French Engineering School Insight Centre for Data Analytics, University College Cork [email protected] ! Abstract The large amount of data collection available in the modern community requires the development of more efficient tools to analyze such data. The aim of this study is to determine whether or not Markov graphical models for causality are relevant !when it comes to mining Big Data. 1. Motivation Relations of causality between two events, is what we are looking for when trying to explain the occurrence of the second event as a consequence of the first one. This is particularly relevant in prediction and data mining. The tools commonly used to generate such Figure 1: Graphical causal model obtained thanks to predictions often rely on correlation analysis. However , the pcalg package! correlation doesn’t necessarily implies causality, hence the required development and/or analysis of proper ! causal models. 5. Hypotheses An interesting tool for prediction would be the result of Two main hypotheses are made in order to be able to an intervention, represented by the « do-calculus », draw conclusions from the graph, which are : expression introduced by Pearl. - the causal Markov property : - the faithfulness assumption ! The feasibility of these assumptions are in fact subject 2. Problem Statement to a fierce debate between those in favor of the model The problem is exploring the relevance of causal and those opposed to it. models analyzing selected variables from the data in, ! for example, social network or social science. 6. Discussion ! The goal of this study is to compare traditional 3. Related work statistical tools and causal graphical models in This study will be based on previous works that analyzing and extracting useful insights from data in provided tools to graphically model causality. many fields from social networks and human mobility The main theory was built from Judea Pearl’s [1] while to social sciences. To evaluate the model, comparison the initial experiments are performed using pcalg in terms of prediction accuracy will be conducted, in package [2] tool developed for the R software. order to check whether the causal model is actually Other bayesian algorithms to disclose causal relations consistent or not. exist, such as CD-H and CD-B [3] and Dr. Freedman, detractor of this specific model also wrote his own ! theories about the subject [4]. 7. Acknowledgement This publication has emanated from research ! conducted with the financial support of Science 4. Modeling Causality Foundation Ireland (SFI) under Grant Number SFI/12/ To establish and represent relations of causality RC/2289. between several variables, variations of the Inductive ! Causation (IC) algorithm are implemented. 8. References The expected graph models would be a Directed [1] Judea Pearl, Causality: models, reasoning and inference, Acyclic Graphs (DAG) as the one reported in Figure 1. Cambridge University Press, 2000. The figure 1 supposedly represents relations of [2] Markus Kalisch, Alain Hauser, Martin Mächler, Marloes causality between post-operation vitals and the H. Maathuis, Diego Colombo, Peter Bühlmann, « More releasing decision from the hospital (obtained with Causal Inference with Graphical Models in R, Package pcalg package [2]). pcalg », 2014. [3] Subramani Mani, Constantin F. Aliferis, Alexander For instance in this graph we can see that the variables Statnikov, Bayesian Algorithms for Causal Data Mining, BP (blood pressure) and corestb (stability of core NIPS workshop on causality, 2008. temperature) have a direct causal effect on the decision, [4] David Freedman, From association to Causation via which is represented by the arrows directed towards Regression, Conference on Causality in Crisis, 1993 « decision ». !

INSIGHT-SC [68] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Multi-Agent Decision Making Over Wireless Network Martin Bullman 2nd Year BSc Computer Science Student (Intern) University College Cork [email protected]

Abstract 4. Solution We developed a testbed for realtime decision making over The proposed solution is to implement a network of de- wireless networks using multiple Intel Galileo embedded vices which sense sounds and communicate with each other microcontrollers to determine which device is closest to the sound. This could then be extended to localize and track sound sources. Intel provided us with 5 Galileo units for us to use in our re- 1. Introduction search. We extended the Galileos by adding microphones to Real time decision making based on heterogeneous data each unit. For wireless communication we attached an Intel is becoming increasingly important in a wide range of in- Centrino wireless-N35 Wi-Fi card to each board and con- dustry sectors. For many applications, this involves gath- figured the Galileos’ cards to generate an ad-hoc Wi-Fi net- ering data from widely distributed sensors, making appro- work which enabled wireless communication between all priate decisions, and transmitting them back out to remote devices. To indicate the conclusion of the decision process actuators. This process of centralisation is often too slow we also added LEDs to each board. For the coordination for realtime decision making. Most existing research in dis- protocol, any device that hears a sound above a threshold tributed reasoning ignores the issues of wireless sensing, can initiate the process by broadcasting a message to all de- communication and actuation. The aim of this project is vices (usually, we assume a fully connected network). To to construct a wireless testbed that will allow researchers to visually indicate a device has detected a sound above the develop and test new protocols and algorithms. threshold the green LED will flash, all devices will then be- gin a synchronous communication sequence in which each device will broadcast its sound level. The node with the 2. Background lowest ID will broadcast its sound level first, each node will The need to consider wireless networks in distributed then transmit in ID sequence until all nodes had broadcast reasoning has been identified [1]. Intel have released their level. After the synchronous communication sequence Galileos [2], which are microcontroller boards based on the was complete, each node could then decide who was the Intel Quark SoC X1000 application processor. The Intel closest to the sound. We could then actuate and show the Galileo also has wireless capabilities and are Audrino com- result by turning on a green LED for the node that had the patable. This is a step beyond traditional embedded micro- highest sound level. controllers, which were standalone. 5. Evaluation We successfully assembled an Intel Galileo wireless test 3. Problem Statement best on which researchers can now try to test and imple- Construct a testbed using multiple Galileos which would ment new protocols and algorithms. We also constructed demonstrate the principle of sensing, communication, actu- a demonstration application on top of our wireless test bed ation and coordination using wireless networks, (Fig 1), and to demonstrate issues in multi agent decision making over which would allow researchers to implement and test new wireless networks. When we have a small number of de- algorithms on this wireless platform. vices (three or four) our protocol for sensing, communicat- ing and actuation will work reliably but when we increase the number of devices and the speed at which they are com- municating, packet loss will increase dramatically.

6. References [1] Mohamed Wahbi and Kenneth N. Brown”The impact of wireless communication on distributed constraint satis- faction” CP 2014, 20th Intl Conf on Principles and Practice of Constraint Programming, Lyon, September 2014. [2] Intel Galileo embedded Microcontroller Data Sheet: Figure 1: Intel Galileos connected to sound sensors used to https://communities.intel.com/docs/DOC-21835 demonstrate the application.

INSIGHT-SC [69] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Multimodal Human Motion Capture and Synthesis Marc Gowing1, David S. Monaghan1, Noel E. O’Connor1 Insight Centre for Data Analytics [email protected], [email protected], [email protected]

Abstract attain accuracy comparable to expensive professional Mo- Human motion capture (MoCap) involves the sensing, Cap systems? recording and mapping of human motion to a digital model. It is useful for many commercial applications and 5. Hypothesis fields of study, including digital animation, virtual real- This research will demonstrate that, the accuracy of pose ity/gaming, biomechanical/clinical studies and sports per- estimation could be improved by fusing depth and inertial formance analysis. What follows is an overview of the cur- modalities for indoor environments. This motion capture rent research being carried out in Insight. system can in turn be used to create a personalized database of pre-captured motion, which can be used to synthesize 1. Motivation motion from wearable sensors in more challenging environ- Highly accurate motion capture systems are typically ex- ments. pensive and restricted to an indoor studio environment. The ability to capture motion in everyday surroundings using 6. Proposed Solution low cost equipment is highly desirable for several reasons, Our low cost MoCap system is used to record a database having potential for home rehabilitation and performance of motions and construct a KD-tree for outdoor usage. The analysis of local/non-elite athletes in more realistic settings. KD-tree is a data structure that partitions the database of available motions based on their similarity, facilitating effi- 2. Problem Statement cient retrieval. During outdoor operation, for which depth Low cost depth sensors such as the Microsoft Kinect pro- sensors are not suitable, we rely on sparse inertial sensor vide 2.5D scene geometry, greatly simplifying the task of data to query the KD-tree and synthesize full body pose. foreground extraction and coarse pose recognition, crucial In our approach, we overcome the bottleneck in database elements in MoCap. However, they are susceptible to occlu- creation by using a Microsft Kinect to record the original sions and are generally unsuitable for large outdoor environ- full body motions and then subsequently enable the sys- ments. Inertial sensors offer the most promising solution for tem to dynamically update and expand the database when both indoor and outdoor tracking, though they must be cou- unique motions are identified. pled with slower more accurate sensors/correction methods to avoid errors associated with drift. Additionally, as each 7. Evaluation body segment must have a sensor attached, this greatly in- To evaluate the accuracy of the system, the motion syn- creases the cost and intrusiveness of the system. thesis results are compared to a Vicon system2, a gold stan- dard MoCap system. The accuracy of the system is pre- 3. Related Work sented in terms of the mean squared error for joint posi- tion. To date, using a database of Kinect motions and wear- Much research to date has focused on the use of cheap able accelerometers the system can achieve synthesis error accelerometers or Inertial Measurement Units (IMU) at- of less than 14cm for the hands, and less than 5cm for joints tached to the body to estimate the orientation of body seg- closer to the body. This is comparable to Kinect skeleton ments and infer full body pose [1]. To reduce the number of tracking indoors, though the synthesis system is capable of wearable sensors required, some authors [2] have attempted working in both indoor and outdoor environments. Future to fill in the information gaps by cross-referencing a pre- research aims to reduce this error further using more sophis- captured database of high quality MoCap data. The major ticated fusion techniques. limitation of such an approach is that the accuracy of the reconstructed motion depends on the richness/suitability of the database to the motions performed. A database such as References this is typically recorded once using MoCap equipment and [1] D. Roetenberg. Inertial and magnetic sensing of human mo- extensively post-processed, resulting in a static database tion. University of Twente, 2006. [2] R. Slyper and J. K. Hodgins. Action capture with ac- that cannot be updated. celerometers. In Proceedings of the 2008 ACM SIG- GRAPH/Eurographics Symposium on Computer Animation, 4. Research Question pages 193–199. Eurographics Association, 2008. Could depth and inertial sensors be combined to create 1Acknowledgement - EU REVERIE FP7-ICT-287723 a low cost human motion capture and synthesis system that 2www.vicon.com

INSIGHT-SC [70] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Obesity, Vaping & Referendums – Opinion Mining & Social Media Analysis Ayokunle Adeosun Supervisor: Dr. Adam Bermingham [email protected] [email protected]

Abstract fast-foods? What influences people’s opinions on The purpose of this research is to understand what specific referendums? Can we understand people’s influences people’s opinions and triggers their lifestyles and attitude through social media analysis? behaviors. We used social media as a medium to collect data on people’s thoughts and opinions. To be able to 5. Hypothesis do this effectively, we developed a methodology that It is hypothesized that social media topics and allows us to collect large amounts of data. We studied opinions has an influence on people’s decisions, on what people were saying about a number of past real-world situations and vice-versa. Our aim is to referendums and what their thoughts were regarding understand the key factors that contribute to people’s fast-foods, fizzy drinks and e-cigarettes. We examined opinions. These factors will include government their tweets to see if there were patterns in the things policies, taxation or advertising strategies carried out by they wrote. businesses online or in the real world.

1. Motivation 6. Proposed Solution The motivation to carry out this research came from Using Datasift3 - an online platform – we were able the fact we want better ways to track political opinion to collect large amounts of historical data using and attitudes and behaviors towards risky behaviors. powerful historical and location based queries. There Social media analysis is a good way to do this as are several social media sources to choose from such as demonstrated in related work but now we can do it Facebook, Twitter, YouTube, Blogs and News. For this bigger better faster than before. research, we used Twitter and we are able to get over 5 million tweets on Irish and New Zealand based cohorts 2. Problem Statement in two days. Datasift also provides sentiment analysis When researchers want to gather information from score with the data. This can be used for understanding social media, they would have to set up crawlers and attitudes and trends voting intentions. We got tweets on use open APIs. This is a very slow and tedious method health prevention topics such as fizzy drinks, fast foods, due to restrictions such as rate limits, access to vaping and a corpus for nine different referendums. historical data and lack of resources for multiple studies. This method would have to be repeated for 7. Evaluation every single topic they wished to study. We also looked at the possibility of predicting results of referendums results by studying these data. 3. Related Work During the research, we studied the UK’s Alternative There are several researches that have used social Vote referendum and found more support for ‘Yes’ than media to gather data such as Kim Strandberg’s “A ‘No’ but it was the complete opposite in the results. Our social media revolution or just a case of history next step is to figure out why this was the case. repeating itself? The use of social media in the 2011 1 Finnish parliamentary elections” — in it, he uses multiple social media sites to survey the 2011 Finnish election. Adam Bermingham also wrote a paper titled “On Using Twitter to Monitor Political Sentiment and Predict Election Results”2 where he uses sentiment analysis and volume-based measures to capture the voting intentions of people. The study of social media and its effect on referendums has not been studied before; most similar 8. References works have usually focused on one election or [1] Kim Strandberg , Department of Politics and Administration, Åbo referendum rather than multiple referendums. Akademi University, Finland [2]Bermingham, Adam and Smeaton, Alan F. (2011) On using Twitter to monitor political sentiment and predict election results.In: 4. Research Questions Sentiment Analysis where AI meets Psychology (SAAIP) Workshop The questions are: Can we develop a methodology at the International Joint Conference for Natural Language Processing that allows researchers to collect data on a large scale? (IJCNLP), 13th November 2011, Chiang Mai, Thailand. What triggers risky behaviors such as vaping and eating [3] https://datasift.com/

INSIGHT-SC [71] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Predicting Retweets Daniel-Emanuel Lal Insight Centre for Data Analytics E-mail [email protected]

Abstract search to find the best values for hyperparameters such as We report our work on predicting which tweets result in the number of trees in Random Forest estimators [1]. the highest number of retweets and favourites. We use a dataset from the ACM RecSys Challenge 2014, which con- 3 Key insights tains tweets that the IMDb iOS app produces when a user Binary classification. Approximately 95% of the tweets rates a movie. We argue that a careful methodology and a in the training set have zero user engagement (no retweets number of insights about the dataset will allow us to achieve or favourites). We calculated that an estimator that correctly high prediction accuracy. predicts which of a user’s tweets have zero engagement and simply orders all other tweets of that user by tweet id would 1 Introduction have an nDCG@10 on the test set of 0.98. Hence, one of Much recommender system evaluation focuses on mea- our lines of investigation is binary classification: simply suring the accuracy of rating predictions or the precision of predicting which tweets have zero and which have positive top-n recommendations [3]. The ACM RecSys Challenge engagement. 2014 explores a different evaluation measure: “Instead of a traditional evaluation predicting ratings or relevant items, Balanced training set. Since randomly-sampled training the participants are tasked with predicting which items sets will contain so few examples with positive engagement, generate the highest user engagement, i.e. favourites and learning algorithms are likely to treat these examples as lit- retweets.”[2]. Specifically, the task is to predict retweets tle more than noise. Hence, another of our lines of investi- and favourites for tweets that the IMDb iOS app produces gation is balanced training sets: we randomly resample the when a user rates a movie. examples with positive engagement until the number of ex- The organizers of the challenge have three datasets of amples of each kind (zero and positive) is approximately tweets: a training set; a test set for use by participants during equal. the competition; and an evaluation set, which is withheld from participants, for final evaluation of systems and selec- Ensembles. Another of our lines of investigation focuses tion of winners. For each tweet, the datasets include the on ensembles of estimators, which are often more accurate IMDb id of the rated movie, the movie rating, the retweet than individual estimators [1]. count and the favourite count, among other things. The fea- tures we currently use are: the number of followers (which we cap at 2500), the tweet exposure time (between creation 4 Conclusions and scraping), and the movie rating. We discard tweets We are continuing to work on this Challenge, following whose movie rating is out of range, and we use variance the lines of investigation that we have set out in this paper. scaling on all three features. At present, on the public test set we are not yet close to The goal is to use the training set to build estimators that achieving perfect classification and therefore we fall short can rank the tweets in a test set in descending order of en- of an nDCG@10 of 0.98. We are trying different classifiers gagement, within user. Where two of a user’s tweets are and different features in an effort to increase accuracy.1 predicted to have the same engagement, they are ordered by tweet id. Evaluation of a user’s predicted tweet ranking uses References nDCG@10 [3], which is then averaged over all users. The [1] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of nDCG@10 lies between 0 and 1, with 1 being best. Statistical Learning (2nd edition). Springer, 2009. [2] A. Said, S. Dooms, B. Loni, and D. Tikk. ACM RecSys Chal- lenge 2014. http://2014.recsyschallenge.com/. 2 Method [3] G. Shani and A. Gunawardana. Evaluating recommender sys- It is tempting to: use the training set to build lots of dif- tems. In F. Ricci et al., editors, Recommender Systems Hand- ferent estimators; evaluate them on the test set; and select book, pages 257–297. Springer, 2011. the one with the highest score. But this will most likely result in an estimator that overfits the test set and does not generalize well to the evaluation set. Instead, for error estimation, we use nested 10-fold 1This publication has emanated from research supported in part by a cross-validation on the training set. The inner cross- research grant from Science Foundation Ireland (SFI) under Grant Number validation performs model selection, i.e. it uses a grid SFI/12/RC/2289, and supervised by Derek Bridge and Marius Kaminskas.

INSIGHT-SC [72] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

A scalable adaptive method for complex reasoning over streams Thu-Le Pham, Alessandra Mileo National University of Ireland Galway thule.pham, alessandra.mileo @insight-centre.org { }

Abstract In the knowledge representation and reasoning research Data streams are the infinite sequences of data elements area, the state of the art in reasoning over changing data that are generated by a number of sources at high rate. is temporal logic and belief revisions. Recent attempts fo- Current stream processing solutions can handle streams of cus on extending the well established declarative complex data to timely produce new results but they lack complex reasoning framework Answer Set Programming (ASP) with reasoning capacities. Conversely, engines that can perform dynamic data. However, these approaches still mainly pro- such complex reasoning tasks, are mostly designed to work cess on low changing data and relatively smaller data sizes. on static data. In this proposal, we tackle the challenge of bridging the gap between stream processing and complex 4. Research Question & Hypothesis reasoning for more scalable solutions. In this research, we intend to focus on enriching the abil- ity of reasoning on data streams. The main question gets 1. Motivation refined down to “How can a system perform scalable com- Nowadays, we are witnessing an increase in quantity and plex reasoning over streams?”. This question can be ad- quality of data coming from the Internet and sensors. Data dressed by defining a suitable reasoning model that can (i) streams have been utilized by various modern applications capture the dynamic properties of data over time, (ii) pro- such as environment monitoring, traffic management, health vide the mechanism to extract high level knowledge, (iii) assessment, etc. These real applications face many difficult react to new and expired information, and (iv) suggest the challenges because they have to deal with massive, ordered, appropriate adaptations for the stream reasoning problem to incomplete, heterogeneous, and noisy data, and to perform enhance scalability. inference for exhibiting the implicit knowledge behind their data. Additionally, the response time is strictly limited in 5. Proposed Solution such real appications. Current reasoners can not deal with In order to answer the above research question, we plan the above mentioned issues. Therefore, there is a clear need to combine the benefits of stream processing and reason- for design and implementation of complex and scalable rea- ing by performing the formal analysis of synergies between soning methods over data streams. query processing engines and logical reasoning methods. Stream processing engines can reduce the enormous volume 2. Problem Statement of data via query pattern matching. This advantage reduces the complexity for the reasoners. Based on this intuition, Most of the real-time applications demand some forms we will design a rule layer that can provide declarative and of reasoning with varying complexity. An open challenge scalable complex problem solving capabilities. This will is how to identify the right trade-off between the reasoning be made possible by i) defining a uniform formal semantics capabilities and expressivity. We intend to explore a new for stream processing and complex reasoning by exploiting scalable adaptive stream reasoning approach for detecting the relationships between SPARQL 1.1. and ASP, ii) imple- relevant events from streaming data, and suggesting appro- menting a multi-logic rule layer that extends the formalism priate changes in the stream reasoning process. of reactive Multi-Context Systems [1] and iii) setting up a benchmark for trade-off analysis between complexity and 3. Related Work scalability. There are various existing approaches airming to per- form reasoning over data streams [2]. In stream processing, References the existing solutions are divided into two categories: (1) [1] M. Dao-Tran, T. Eiter, M. Fink, and T. Krennwallner. Dis- Data Stream Management Systems and (2) Complex Event tributed nonmonotonic multi-context systems. KR, 10:60–70, Processing [3]. The former approach has some well-known 2010. engines such as CQELS and C-SPARQL that have ability [2] E. Della Valle, S. Schlobach, M. Krotzsch,¨ A. Bozzon, to process continuously low-level data streams at high rate. S. Ceri, and I. Horrocks. Order matters! harnessing a world of orderings for reasoning over massive data. Semantic Web, The later approach considers observable raw data as primi- 4(2):219–231, 2013. tive events and expresses composite events by some specific [3] A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Stream- operators. These approaches do not manage the uncertainty ing the web: Reasoning over dynamic data. Web Semantics: associated to data and do not perform complex reasoning Science, Services and Agents on the World Wide Web, 25:24– tasks. 44, 2014.

INSIGHT-SC [73] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

The Stable Marriage Problem Aodhagan´ Murphy 2nd Year UCD B.Sc. in CSI student Insight Centre for Data Analytics, University College Cork E-mail [email protected]

Abstract 3. Proposed Solution A solution to an extended Stable Marriage Problem to pro- We are using constraint programming to solve this prob- vide a stable matching of lecturers to modules by use of the lem. Constraint programming is a type of declarative pro- combinatorial optimization platform, Numberjack. gramming where the relations between variables are de- scribed by a set constraints. 1. Introduction A CSP Gale-Shapley approach to solving the classic The Stable Marriage problem is a classic problem of variations of the Stable Marriage Problem have been done combinatory and matching with preferences. It has been in the literature [1]. the subject of much interest from Game Theory, Economics By use of nogoods constraints on our CSP problem is and Operations Research communities. There are n men given a set of n conflict matrices of size m. These conflict and n women, where both sets have a list of preferences matrices define the blocks as a result of a currently matched and the goal is to match every man with a woman according pair. This model is then appended with some more con- to the preferences and in obtaining a stable solution where straints for instance if matched only one lecturer can do a nobody can change for a better matching. From this basic specific module and the lecturer cannot exceed the quota, problem, several similar cases have been studied including i.e. the sum of the weights of his modules. for example the ”college admission problem” or ”the sex- From this first model and CSP programming, we are equal stable marriage”. looking for a new CSP model for the extended stable mar- The main motivation to conduct research on stable mar- riage problem with a sex-equal matching. riages is to resolve assignments of lecturers to teaching modules in such a way that preferences are respected and 4. Evaluation the matching is optimal. We used Numberjack for other simpler problems. Num- berjack is a modelling package written in Python for con- 2. Problem Statement straint programming. The Gale Shapely Algorithm is prob- A stable matching is obtained when there is no pair that ably the most well-known solution to the stable marriage is unstable/no blocking pairs. A blocking pair is one in problem and it’s variants however it is interesting to provide which a man A prefers a woman B to his currently matched a constraint programming approach to one of it’s extensions partner and the woman B prefers man A to her currently as constraint programming is not a broadly researched topic matched partner. In Figure 1 we can see an example of a and it used in real world solution shows it’s practicality and blocking pair. W1 prefers M3 to her currently matched part- performance. We will compare the different CSP models ner M1 and M3 prefers W1 to his currently matched partner that we will implement and to compare all the results ac- W3. Both can change for the better. cording to the parameters. Insight@UCC are studying a specific stable marriage problem with capacities and where someone from one of the set (for example the men) can change of side (going in the women set). Using the proposed solution for our prob- lem, we will try to extend this result for this new version.

References [1] I. P. Gent, R. W. Irving, D. Manlove, P. Prosser, and B. M. Figure 1: An example of an unstable matching Smith. A constraint programming approach to the stable mar- Our extension on this problem is to provide a stable riage problem. In CP, pages 225–239, 2001. matching of n lecturers to m modules. The preference list contains unfinished list and ranks in the preference lists can Acknowledgement be ties. Each module is weighted and the lecturers will be This publication has emanated from research conducted given the same workload failing that as close to this as pos- with the financial support of Science Foundation Ireland sible. And the objective is to obtain an optimal solution in (SFI) under Grant Number SFI/12/RC/2289. regards to a sex-equal point of view, i.e. where men and women have a parity in the sex-equal problem

INSIGHT-SC [74] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Towards Automatic Activity Classification During a Sports Training Session Edmond Mitchell, Noel E. O’Connor Insight Centre for Data Analytics [email protected]

Abstract 5. Proposed Solution Motion analysis technologies have been widely used to en- In this work four different classifiers were investigated in hance athlete performance. However, most of these tech- order to create the most accurate classification system. The nologies are expensive, can only be used in laboratory en- classifiers employed were lazy IBk, RBF Network, Naive vironments and examine only a few trials of each movement Bayes and Random Forest. The DWT has been used with action. A system is presented that automatically classifies a much success in extracting discriminative features from ac- large range of training activities using the Discrete Wavelet celerometer data as the basis for classification. In devel- Transform (DWT) and a classifier. A number of different oping our approach to activity classification, the exercise classifiers are investigated with the random forest classifier routine performed by each athlete was segmented and an- achieving the highest overall accuracy of 98%. notated for all activities and used to create a training set. A window length of three seconds was chosen as this was suf- 1. Motivation ficient time for each of the selected training activities to be Sport and physical activity have important cardiovascu- completed. These activities can be seen in Figure 1. lar, musculoskeletal and mental health benefits and are en- joyed by large number of people. Technology can help peo- 6. Methodology ple improve their ability to train therefore improving their EDRA represents the energy ratio of the DWT approx- wellbeing. The ability to be able to automatically identify imation coefficients while EDRDj represents the energy different activities performed during a training session has ratio of the DWT detail coefficients. The normalized vari- many benefits. For amateur athletes this information would ances of the DWT decomposition coefficients and the EDRs allow them to evaluate their performance during different provided the most informative features. The variances of training sessions. Professional athletes along with their the coefficients are calculated over each DWT coefficient coaches would be able to use the data to ensure the athlete vector at the ith level. These features are then inputted into is maintaining sufficient progress to achieve their training different classifiers in order to asses each classifiers ability. goals. 7. Evaluation 2. Problem Statement Figure 1 gives the classification scores for each activ- Wearable Inertial Measurement Units (WIMUs) are ca- ity for each of the classifiers. It can be observed that the pable of tracking rotational and translational movements random forest classifier performs the best with an overall and are gaining popularity to monitor human movements accuracy of 98%. in a number of sport training applications [1]. This allows a subjects activities be continuously monitored outside clini- cal environs. With the recent development of more accurate and relatively cheap WIMUs, it has become feasible to de- ploy wearable body sensor networks in training sessions. In this work features are extracted from these WIMUs and fed into different classifiers

3. Related Work Much of the prior research in activity classification has dealt with identifying mundane tasks such as eating, as- cending and descending stairs, sitting, brushing teeth as Figure 1: Comparison of Classifiers well as motion activities such as being stationary, walking, running, training exercises and sports activities. References [1] H. Ghasemzadeh, V. Loseu, E. Guenterberg, and R. Jafari. 4. Research Question Sport training using body sensor networks. In Proceedings of What is the best feature and classifier combination that the Fourth International Conference on Body Area Networks, would create the most accurate classification models from page 2, 2009. WIMUs data.

INSIGHT-SC [75] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Visualizing Geographic Ride-Sharing Data Yves Sohege 2nd Year BSc in Computer Science (INTERN) Insight Centre for Data Analytics, University College Cork [email protected]

Abstract What is the best way to visualize ride-sharing data for users? We are exploring different ways to accomplish this using map-based interfaces to encourage people to use ride- sharing on a daily basis.

1. Introduction Ride-sharing has many benefits. Its is important for im- proving air quality, reducing congestion and reducing car- bon emissions by reducing the number of vehicles on the road. The problem with ride-sharing is that it only works if enough people participate. To attract participants to the scheme we need a way of showing them what rides are Figure 1: The Grid Search interface. Blue represents the available. The ride-sharing data is complex, depending on origin and Red the destination. time and location, with regular trips and one-off trips. This abstract describes an undergraduate internship project on vi- Although the grid method is easy to understand, the grid sualizing ride-sharing data. 1 squares do not necessarily match areas that the user wants to start from. We have developed a second approach that 2. Data and Problem allows the user to draw a polygon on the map where he/she wishes to start from. The app will search for all routes orig- The source for the data that we are visualizing is the inating within the polygon. This is implemented by finding Carma API which was made available to us. Carma[1] is a the centroid of the polygon, creating the smallest circle that company that provides a ride-sharing service to users. The contains all polygon points, retrieving all routes within that data includes geographic latitude and longitude locations circle from the database, and then filtering to remove any for the origin and destination and time of departure. The routes that start outside the polygon (Figure 2). research questions we are interested in are questions like : (i). What are good ways to visualize ride sharing data for potential participants? (ii). How do you display complex information in an easy to understand way? We are exploring a number of different approaches, start- ing with simply displaying individual routes on a map, and allowing users to change dates and location. However, rather than focus on individual routes, we believe prospec- tive users are more interested in seeing a spread of options. Two recent ideas are promising, and are discussed below. Figure 2: The Polygon Search interface. The blue polygon is the area that will be searched for trips. 3. Solution This approach answers the question ”Where can I get to from here?”. We start by overlaying a grid over the map. 4. Evaluation When a user clicks on a grid square all routes originating in The two approaches are being evaluated through user that area, for a specified time period, are retrieved from the feedback. This feedback will then be incorporated to im- database and the grid squares containing the end-points of prove subsequent solutions. these routes are highlighted. The method can be varied to show the number of routes to any grid square and handles 5. References different zoom levels (Figure 1). [1] Carma Website : https://carmacarpool.com 1This publication has emanated from research conducted with the fi- nancial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

INSIGHT-SC [76] Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

List of Keywords

Accelerometer, 13, 58 Data Mining, 15, 42, 49 Access Decision, 24 Data Quality, 46 Ad-hoc Networking, 69 Deep Learning, 61 Affect Recognition, 38 Degradation Analysis, 34 Algorithm Configuration, 48 Dementia, 22 Ambient Assisted Living, 22 Depth Camera, 58 Ambient Intelligence, 22 Dimensionality Reduction, 47 Ambient Technology, 22 Distributional Semantics, 56 Analytics, 59 Diversity, 63 API Usage, 76 Droplets, 52 Association Rule, 42 DSS, 17 Asymmetry,5 DWARF, 17 Automatic Algorithm Configuration, 48 DWT, 75 Dynamic Topic Model, 30 Bayesian Inference,8, 45 Bayesian Statistic, 31 e-Government, 27 Big Data, 68 e-Participation, 27 Bilingual Terminology, 29 Econometrics, 15 Biofeedback, 13 Education, 59, 64 Biomechanic Screening, 58 Electrochemical Sensor, 40 Biometrics, 18 Electroencephalography, 41 Brain-Computer Interface, 41 Energy Awareness, 19 Energy Efficiency, 19 Cardiorespiratory Fitness, 12 Energy Management, 50 Cardiovascular Disease, 12 Entity Linking, 20 Cassandra, 17 Entity Recognition, 20 Categorical Data Clustering, 60 Environmental Monitoring, 32 Cauchy Projection, 47 Ethics, 22 Causal Model, 68 Evaluation, 63 Chemical Analyzer, 32 Event Pattern,7 Chronic Ankle Instability, 16 Event Processing, 54 Citation, 67 Exercise, 66 Classification,6 Explanation Interface, 37 Clinical Data Mining, 61 Exponential Random Graph Model,8 Clustering, 53 Community Finding,4 Face Detection, 39 Complex Event Processing,7 Face Recognition, 39 Complex Network, 45 Facebook, 10 Composite Likelihood,8 Fatigue, 16 Computational Linguistics, 21 Feature,6 Computer Vision, 11 Federated Query,1 Constraint, 74 Federation, 24 Content-based Recommendation,3 Forestry, 14 Continuous Analysis,5 Continuous Annotation, 38 Gaussian Process, 34 Crowdsourcing,2 General Sampling,9 Customer Review, 21 GrabCut Algorithm, 41 Graph, 25 Data Analytics, 10, 15, 64 Groin Pain,5 Data Center, 44 Gyroscope, 13, 58 Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Heterogeneous Information Network, 53 Multidimensional Database, 35 Heuristics, 25 Multimodal, 38, 43 Human Computer Interaction, 18 Musculoskeletal Pain Classification, 60 HVAC Maintenance, 34 Music Classification, 23 Hydrogel, 51 Music Information Retrieval, 23

Image Processing, 11 Natural Language Understanding, 20 IMU, 13, 58 Neural Network Language Models, 56 Incremental Vector Space Construction, 47 News Tracking, 28 Inertial Sensor, 13, 16, 58, 70 NoSQL, 17 Information Extraction, 62 Nutrient, 32 Informed Consent, 22 Interactive Feedback, 19 Object Segmentation, 41 Interactive Segmentation, 41 Occupant Location Prediction, 42 Internet of Things, 54 OLAP, 17 Intractable Likelihood, 31 Online Advertising, 10 Ising Model, 31 Online News Recommender, 55 Item Representation,3 Online User Behavior, 30 Open Data, 50 Job Recommendation,3 Opinion Mining, 26, 37, 71 Optical Network, 25 Kellegren-Lawrence Grading,6 Optimisation, 44, 76 Keyword Sense Disambiguation, 65 Oracle Inequality,9 Kinect, 13, 58, 66 Osteoarthritis,6

L1 Norm, 47 PAC-Bayesian Bound,9 Label Propagation,4 Parallel Algorithms,4 Latent Class Analysis, 60 Particle Filter, 34 Latent Space Model, 45 Periodicity, 43 Learning, 59 Personal Sensing, 13, 58 Lifelogging, 35, 43 Personalized Newspaper, 55 Linked Data,1, 46 Photo-control, 51 Linked Stream Data,1 Portfolio Solver, 33 Load Balancing, 44 Predictive Analysis, 50 Local Search, 25 Privacy Protection, 39 Longitudinal, 43 Quality Estimation, 36 Quality of Service,7 Machine Learning, 11, 33, 46, 49, 61, 75 Magnetometer, 13, 58 Random Projection, 47 MARG, 13 Ranking, 53 Markov Chain Monte Carlo, 31 Rapid Serial Visual Presentation, 41 Matching, 74 RDF,1 Matrix Completion,9 RDF Data Warehouse, 36 Media Analytics, 62 Real Time Learning, 48 Metropolis-Hasting,9 Real Time Protocols, 69 Micro-vehicle, 52 Reasoning, 73 Microblog Retrieval, 56 Recommender System, 11, 23, 26, 35, 37, 55, 63, 72 Microfluidics, 32, 40, 51, 52 Rehabilitation, 58, 66 Moodle, 64 Representation Learning, 56, 61 Motion Capture, 66, 70 Research Impact, 67 Motion Synthesis, 70 Resistance Training, 13 Mouse Dynamics, 18 Retweet Prediction, 72 Movement Variability, 16 Multi-armed Bandit,2 SAT Problem, 33 Book of Abstracts Insight Student Conference (INSIGHT-SC 2014), Dublin, Ireland, September 12, 2014

Scale-free Network, 45 Strength and Conditioning, 13, 58 Search, 57 Summarization, 28 Self-Propelled, 52 Sweat Monitoring, 40 Semantic Matching, 54 Semi Online Algorithm, 44 Task Assignment,2 Sentiment Analysis, 21, 26 Term Extraction, 29 Serious Game, 66 Text Processing, 20 Service Computing,7 Topic Tracking, 56 Shimmer, 13, 58 Trust Management, 24 SIFT, 57 Tweet Classification, 65 Similarity Function,3 Twitter, 28, 49, 71 Small-world Network, 45 Twitter Noise Filtering, 65 Smart Home, 19 Social Media, 27, 28, 49, 71 Uncertainty, 14 Social Media Analysis, 53 User Lifecycle, 30 Social Network Analysis, 45 Social Web, 67 Variable Selection, 60 Source Selection, 50 View Materialization, 36 Spamming, 46 Visual Lifelogging, 39 SPARQL,1 Visual Object Retrieval, 57 SPARQL Live Querying, 36 Visualization, 76 Spiropyran, 51 Spontaneous Affect Dataset, 38 Water Quality, 32 Stable Marriage, 74 Wearable Technology, 13, 58 Statistical Machine Translation, 29 Wireless Communication, 69 Stochastic Programming, 14 Wireless Device, 40 Stream Processing, 73 Stream Reasoning, 73 Youth, 12