Second European Conference on Data Analysis July 2

Second European Conference on Data Analysis

July 2 – 4, 2014

Program & Abstracts Second European Conference on Data Analysis

July 2 - 4, 2014

Jacobs University Bremen

Held under the patronage of Ingo Kramer President of the Bundesvereinigung Deutscher Arbeitgeberverbande¨ (BDA) German Employers Association Sponsors

ECDA 2014 has been supported by

Many Thanks !!! First Announcement and Invitation to Participate

Third European Conference on Data Analysis 2015

ECDA2015

joint conference of British, German and Polish Classification Societies !

University of Essex, Colchester, UK 2 - 4 September 2015

Conference organiser: Professor Berthold Lausen Department of Mathematical Sciences, University of Essex Wivenhoe Park, Colchester, CO4 3SQ, UK; email: [email protected]

Second European Conference on Data Analysis

July 2 - 4, 2014

Jacobs University Bremen

Program & Book of Abstracts VI Scientiﬁc Program Committee

Scientiﬁc Program Chair Hans Kestler Ulm University, Germany

Statistics and Data Analysis Francesco Mola University of Cagliari, Italy Claus Weihs Technical University Dortmund, Germany Roberto Rocci University of Rome Tor Vergata, Italy Christian Hennig University College London, UK

Machine Learning and Knowledge Discovery Eyke Hllermeier Philipps University of Marburg, Germany Friedhelm Schwenker Ulm University, Germany Myra Spiliopoulou Otto-von-Guericke University Magdeburg, Germany

Data Analysis in Marketing Jozef Pociecha Cracow University of Economics, Poland Daniel Baier Brandenburg Technical University Cottbus, Germany Wolfgang Gaul Karlsruhe Institute of Technology, Germany Reinhold Decker Bielefeld University, Germany

Data Analysis in Finance and Economics Marlene Mller Beuth-University of Applied Sciences Berlin, Germany Gregor Dorﬂeitner University of Regensburg Colin Vance RWI Essen

Data Analysis in Medicine and the Life Sciences Matthias Schmid Ludwig-Maximilians-University Munich, Germany Iris Pigeot Leibniz Institute for Prevention Research and Epidemiology, Bremen, Germany Berthold Lausen University of Essex, UK

Data Analysis in the Social, Behavioural and Health Care Sciences Ali Unl¨ u¨ Technical University Munich, Germany Ingo Rohlﬁng Jacobs University Bremen/University of Bremen, Germany Karin Wolf-Ostermann University Bremen, Germany Jeroen K. Vermunt Tilburg University, The Netherlands VII Data Analysis in Interdisciplinary Domains Adalbert Wilhelm Jacobs University Bremen, Germany Patrick Groenen Erasmus University Rotterdam, The Netherlands Sabine Krolak-Schwerdt University of Luxembourg, Luxembourg Frank Scholze Karlsruhe Institute of Technology, Germany Andreas Geyer-Schulz Karlsruhe Institute of Technology, Germany

LIS’2014 Frank Scholze KIT Karlsruhe, Germany (Chair) Ewald Brahms Univ. Hildesheim, Germany Andreas Geyer-Schulz KIT Karlsruhe, Germany Stefan Gradmann KU Leuven, Belgium Hans-Joachim Hermes TUniv. Chemnitz, Germany Monika Losse¨ German National Library Leipzig, Germany Bernd Lorenz FHV Munich, Germany Michael Monnich¨ KIT Karlsruhe, Germany Sylvia van Peteghem Univ. Ghent, Belgium Magnus Pfeffer HdM Stuttgart, Germany Heidrun Wiesenmuller¨ HdM Stuttgart, Germany Preface

Welcome Note by the Dean of Graduate Studies and the Focus Di- versity in Modern Societies at Jacobs University Bremen

Dear colleagues,

It is with great pleasure that I welcome you, the participants of the Second Eu- ropean Conference On Data Analysis, to Jacobs University. Let me state that I have personally a great afﬁnity to data analysis, relating on the one hand to my training in the context of experimental psychology and psychophysiology, already many years ago, and on the other hand to recent collaborations I had with colleagues in physics, mathematics, and engineering, that involved very large data sets, machine learning approaches, modeling approaches, and complex visualizations. Much has happened in the last decades. And yet, it is a fact that “classic training” in data analysis in many academic ﬁelds is often based on methods that are 50 years old or more. Clearly, there is the need to feed back new developments in data analysis to training not only graduate, but also undergraduate levels- Thus. there is also a strong institutional reason why I am happy to have you here: One of the hall- marks of the training of our BA students has always been a strong, contemporary, and interdisciplinary focus on data analysis and statistics with a broad curriculum that would apply across the board of disciplines. As we are currently reforming all BA and BSc programs, we are further pushing for a broad education in dealing with data regardless of the major students are enrolled in. All Jacobs students will have strong exposure to methods and reasoning involved in data analysis, whether they will pursue a career in science, or in business. In fact, we are particularly interested in bridging the divide between basic research and applications in the real world, which, I understand from your program, is also one of the concerns of the present meeting. It is no coincidence that data analysis is central to all of the three new transdisciplinary foci that are guiding research and teaching at Jacobs University in the future: Diversity – in modern societies, Health Preface IX

– focus on bioactive substances, and Mobility – of people, goods and information. Thus, it is particularly ﬁtting to have you here. Please leave some of the vibes of the meeting within our walls and we shall focus them on our students. I wish you a productive and inspiring meeting on our campus.

Bremen, Arvid Kappas, PhD June 2014 Professor of Psychology and Dean X Preface

Dear participants,

We extend a cordial welcome to all of you and wish you all a productive, successful, and enjoyable Second European Conference on Data Analysis 2014. This second edition of the European Conference on Data Analysis is held under the patronage of Ingo Kramer, President of the Bundesvereinigung Deutscher Arbeitge- berverbnde (BDA), German Employers Association. This conference also marks the 38th anniversary of the German Classification Society (GfKl). This year’s conference has been organised in co-operation with the Italian Statistical Society Classification and Data Analysis Group (SIS-Cladag), Vereniging voor Ordinatie en Classificatie (VOC), Sekcja Klasyfikacji i Analizy Danych PTS (SKAD), and the International Association for Statistical Computing (IASC). “Data Scientist is the sexiest job in the 21st century.” This quote from the Harvard Business Review clearly underlines the importance of data analysis, classification, statistical modelling, machine learning and knowledge discovery. Topics that are at the heart of the Classification Societies and the European Conference on Data Anal- ysis. The speed in which data acquisition and data collection grows is extra-ordinary and fuels the demand for ever faster, more flexible and more complex analysis techniques. As in the previous years, the conference topics are focussed in seven thematic fields: • Statistics and Data Analysis • Machine Learning and Knowledge Discovery • Data Analysis in Marketing • Data Analysis in Finance and Economics • Data Analysis in Medicine and the Life Sciences • Data Analysis in the Social, Behavioural, and Health Care Sciences • Data Analysis in Interdisciplinary Domains. The Workshop on Classification and Subject Indexing in Library and Information Science (LIS’2014) covers an additional theme of utmost relevance for the classification societies. The thematic fields do not provide a clear classification and a strong separation of the various contributions. Quite the contrary, there is a lot of overlap between the areas, and common methodological questions and approaches permeate all thematic fields. So, the split into thematic areas intends to show the breadth of topics and not a division into isolated subcategories. In particular our key note presentations (two plenary, eight semi-plenary) will address issues that cross typical disciplinary boundaries and foster links and interactions between the thematic areas. We are grateful to the members of the program committee who suggested the key note speakers and helped to convince them to partake in ECDA 2014. We are happy to welcome as key note speakers Preface XI

• Themis Palpanas, Universite´ Paris-Descartes, Do It Yourself: Exploratory Anal- ysis of Very Large Scientific Data • Adam Sagan, University of Economy Cracow, Latent Variables and Marketing Theory – the Paradigm Shift • Rebecca Nugent, Carnegie-Mellon University, Solving the Identity Crisis: Large- Scale Clustering with Distributions of Distances and Applications in Record Linkage • Zhi-Hua Zhou, Nanjing University, Learning with Big Data by Incremental Op- timization of Performance Measures • Roberta Siciliano, University Federico II, Naples, The Speed-Interpretability- Accuracy Trade-Off in Decision Tree Learning • Alfred Inselberg, Tel Aviv University, Visualization and Data Mining for High- Dimensional Data • Pedro Pereira Rodrigues, University of Porto, How uncertainty simultaneously drives and hinders knowledge discovery from health data • Andreas Geyer-Schulz, KIT, On a Decision Maker without Preference • Joaquin Vanschoren, Eindhoven University of Technology, OpenML: Open science in machine learning • Friedhelm Schwenker, University of Ulm, Partially supervised learning algorithms in classification and clustering. We thank the Scientific Program Committee for soliciting contributions, for reviewing them in a timely manner, and arranging a program consisting of 155 presentations including six organised invited sessions. We are particularly grateful to those colleagues who organised these sessions by gathering a group of experts for an exchange on the state-of-the-art and recent trends in a specific research topic. At ECDA 2014 the following invited sessions are hosted: • High-dimensional data: using all dimensions organised by Christian Hennig • Recent Advances in Mixture Modelling organised by Roberto Rocci • Data analysis in human.computer interaction scenarios organised by Friedhelm Schwenker • Health technology assessment of community interventions for active healthy ageing organised by Helmut Hildebrandt and Berthold Lausen • SVM Large Scale Learning organised by Claus Weihs • Predictions with classification models organised by Jozef Pociecha. Organising such a conference requires the support and cooperation of many people. It lives from the effort of dedicated colleagues in the organising and the scientific programming committee. We would like to thank the Area Chairs for their work in conference advertisement, author recruitment and the evaluation of submissions. We are very grateful for the continuous support of Bianca Bergmann at Jacobs Univer- sity in preparing the event, coordinating the different administrative units within the university, as well as communicating with our external partners. We thank as well the team of student assistants supporting us. We would also like to express our thanks to the German Research Foundation (DFG), the Bremen International Graduate School XII Preface of Social Sciences (BIGSSS), Jacobs University and all our sponsors for supporting the conference. We wish you an inspiring conference and a pleasant stay in Bremen!

Bremen, Ulm, Colchester Adalbert Wilhelm June 2014 Hans Kestler Berthold Lausen Preface XIII Best Paper Award ECDA 2013

The GfKl Award Jury consisting of Eyke Hullermeier,¨ Sabine Krolak-Schwerdt, Myra Spiliopoulou, Claus Weihs and Berthold Lausen (ex-ofﬁcio) have nominated Ronny Scherer, University of Oslo, and Daniel Stoller, TU Dortmund, for the ECDA 2013 Best Paper Awards. The awards will be handed over during the opening ceremony, including a lau- datio on the awardees and a short presentation of the awarded papers.

Psychometric challenges in modeling scientiﬁc problem-solving competency: An item response theory approach by Ronny Scherer University of Oslo, Faculty of Educational Sciences, Centre for Educational Mea- surement (CEMO), Postbox P.O. 1161 Blindern, 0318 Oslo, Norway [email protected]

Impact of Frame Size and Instrumentation on Chroma-based Automatic Chord Recognition by Daniel Stoller, and Matthias Mauch, Igor Vatolkin, and Claus Weihs TU Dortmund, Chair of Algorithm Engineering {daniel.stoller;igor.vatolkin}@tu-dortmund.de Queen Mary University of London, Centre for Digital Music [email protected] TU Dortmund, Chair of Computational Statistics [email protected] XIV Preface Conference Location

ECDA 2014 is hosted by Jacobs University Bremen, a highly selective, international, residential university situated in the Free Hanseatic City of Bremen. Bremen is an attractive city with UNESCO World Heritage Sites, located in the North of Germany. Its airport and the Central Train Station provide easy access to and from all major cities in Europe. The 80-acre park-like Jacobs University Bremen campus is situated in the Northern part of Bremen (Bremen-Grohn) and can be reached in about 20 minutes by regional train or car from the Central Train Station.

Address: Campus Ring 1 28759 Bremen Germany

Fig. 1. Location of Jacobs University in Bremen-Grohn Preface XV How to come to Jacobs University

By car

Satellite Navigation system: Campus Ring 1, 28795 Bremen (or Bruno-Burgel-¨ Straße 27 or 38, 28759 Bremen) Travelling from the South: • Autobahn 27 direction Cuxhaven, Bremerhaven Exit Bremen-Nord (Burglesum) • At intersection turn left onto the A 270, direction Elsfleth/HB-Blumenthal • Leave Autobahn at exit Bremen-St. Magnus/Grohn/Jacobs University • At end of exit, turn left in direction U 5, Grohn, Jacobs University • At next traffic light turn left into Schonebecker¨ Straße • After the bridge, stay on Bruno-Burgel-Straße¨ • After approx. 300 meters you find the entrance to Jacobs University on the left Travelling from the North:

• Autobahn 27 direction Bremen Exit Ihlpohl • Follow the A 270, direction Bremen-Vegesack • Leave Autobahn at exit Bremen-St. Magnus/Grohn/Jacobs University • At end of exit, turn left in direction U 5, Grohn, Jacobs University • At next traffic light turn left into Schonebecker¨ Straße • After the bridge, stay on Bruno-Burgel-Straße¨ A • fter approx. 300 meters you find the entrance to Jacobs University on the left Travelling from the airport: • Take B 6 from Bremen airport towards Bremen City Center • Follow Autobahn 27, direction Vegesack • Leave A 27 at exit 16 (Bremen-Nord) • Follow the A 270, direction Bremen-Vegesack • Leave Autobahn at exit Bremen-St. Magnus/Grohn/Jacobs University • At end of exit, turn left in direction U 5, Grohn, Jacobs University • At next traffic light turn left into Schonebecker¨ Straße • After the bridge, stay on Bruno-Burgel-Straße¨ • After approx. 300 meters you find the entrance to Jacobs University on the left By rail

• Trains from Bremen’s main station (Hauptbahnhof) to Bremen-Nord leave every 30 minutes – peak times every 15 minutes (mostly from platforms 5 or 6). • Take the train towards Bremen-Vegesack or towards Bremen-Farge. • Get off at Bremen-Schonebeck¨ station (6th stop). • After exiting the train, turn right and go up the stairs at the end of the platform. XVI Preface

• At the end of the stairs turn right again, follow Schonebecker¨ Straße until it bends into Bruno-Burgel¨ Straße. • Follow Bruno-Burgel-Straße¨ until you see on the left side the main entrance of Jacobs University. From Schonebeck¨ station it is a ﬁve-minute walk down Schonebecker¨ Straße and Bruno-Burgel-Straße.¨

! Fig. 2. How to come from the train station at Bremen-Schone¨ beck to Jacobs University!

By airplane

• Tram no. 6 leaves directly in front of Bremen Airport Main Terminal. • Take tram no. 6 (direction Universitat)¨ • Get off at Bremen?s main station (Hauptbahnhof) • Follow steps above for rail journey. Preface XVII Time table train

Personlicher¨ Fahrplan (g ¨ultigvom 01.07.2014 bis 05.07.2014) Bremen Hbf bk Bremen-Schonebeck¨

Ab Zug An Umsteigen Ab Zug An Dauer Verkehrstage 0:07 NWB RS1 K2 0:25 0:18 nicht taglich¨ pa 0:30 Bus N7 1:23 0:53 nicht taglich¨ ps 1:07 NWB RS1 K2 1:25 0:18 nicht taglich¨ pd 1:30 Bus N7 2:23 0:53 nicht taglich¨ ps 2:07 NWB RS1 K2 2:25 0:18 nicht taglich¨ pd 2:30 Bus N7 3:23 0:53 nicht taglich¨ pf 3:30 Bus N7 4:23 0:53 nicht taglich¨ pf 3:45 STR N10 4:04 Gropelingen,¨ Bremen 4:04 Fußweg (1 Min.) nicht taglich¨ pg 4:05 Gropelingen,¨ Bremen 4:05 Bus N7 4:35 0:50 4:30 Bus N7 5:23 0:53 nicht taglich¨ ps 4:34 NWB RS1 K2 4:52 0:18 nicht taglich¨ ph 5:04 NWB RS1 K2 5:22 0:18 nicht taglich¨ ph 5:34 NWB RS1 K2 5:52 0:18 nicht taglich¨ pa 6:04 NWB RS1 K2 6:22 0:18 nicht taglich¨ pa 6:19 NWB RS1 K2 6:37 0:18 nicht taglich¨ ph 6:34 NWB RS1 K2 6:52 0:18 nicht taglich¨ pa 6:51 NWB RS1 K2 7:09 0:18 nicht taglich¨ ph 7:04 NWB RS1 K2 7:22 0:18 nicht taglich¨ pa 7:19 NWB RS1 K2 7:37 0:18 nicht taglich¨ ph 7:34 NWB RS1 K2 7:52 0:18 nicht taglich¨ pa 7:49 NWB RS1 K2 8:07 0:18 nicht taglich¨ ph 8:04 NWB RS1 K2 8:22 0:18 nicht taglich¨ pa 8:19 NWB RS1 K2 8:37 0:18 nicht taglich¨ ph 8:34 NWB RS1 K2 8:52 0:18 nicht taglich¨ pa 9:04 NWB RS1 K2 9:22 0:18 nicht taglich¨ pa 9:34 NWB RS1 K2 9:52 0:18 nicht taglich¨ pa 10:04 NWB RS1 K2 10:22 0:18 nicht taglich¨ pa 10:34 NWB RS1 K2 10:52 0:18 nicht taglich¨ pa 11:04 NWB RS1 K2 11:22 0:18 nicht taglich¨ pa 11:34 NWB RS1 K2 11:52 0:18 nicht taglich¨ pa 12:04 NWB RS1 K2 12:22 0:18 nicht taglich¨ pa 12:34 NWB RS1 K2 12:52 0:18 nicht taglich¨ pa 13:04 NWB RS1 K2 13:22 0:18 nicht taglich¨ pa 13:19 NWB RS1 K2 13:37 0:18 nicht taglich¨ pj 13:34 NWB RS1 K2 13:52 0:18 nicht taglich¨ pa 13:49 NWB RS1 K2 14:07 0:18 nicht taglich¨ pj 14:04 NWB RS1 K2 14:23 0:19 nicht taglich¨ pa 14:19 NWB RS1 K2 14:37 0:18 nicht taglich¨ ph 14:34 NWB RS1 K2 14:52 0:18 nicht taglich¨ ph 14:49 NWB RS1 K2 15:07 0:18 nicht taglich¨ ph 15:04 NWB RS1 K2 15:22 0:18 nicht taglich¨ pa 15:19 NWB RS1 K2 15:37 0:18 nicht taglich¨ ph 15:34 NWB RS1 K2 15:52 0:18 nicht taglich¨ ph 15:49 NWB RS1 K2 16:07 0:18 nicht taglich¨ ph

Alle Angaben ohne Gewahr¨ Softwareversion/Datenstand HAFAS-p2w-V1.4/DATAalls06 - 20.06.14

Fig. 3. Time table train Bremen Main Station to Bremen-Schnebeck XVIII Preface Time table train

Bremen Hbf bk Bremen-Schonebeck¨

Ab Zug An Umsteigen Ab Zug An Dauer Verkehrstage 16:04 NWB RS1 K2 16:22 0:18 nicht taglich¨ pa 16:19 NWB RS1 K2 16:37 0:18 nicht taglich¨ ph 16:34 NWB RS1 K2 16:52 0:18 nicht taglich¨ ph 16:49 NWB RS1 K2 17:07 0:18 nicht taglich¨ ph 17:04 NWB RS1 K2 17:22 0:18 nicht taglich¨ pa 17:19 NWB RS1 K2 17:37 0:18 nicht taglich¨ ph 17:34 NWB RS1 K2 17:52 0:18 nicht taglich¨ ph 17:49 NWB RS1 K2 18:07 0:18 nicht taglich¨ ph 18:04 NWB RS1 K2 18:23 0:19 nicht taglich¨ pa 18:19 NWB RS1 K2 18:37 0:18 nicht taglich¨ ph 18:34 NWB RS1 K2 18:52 0:18 nicht taglich¨ ph 18:49 NWB RS1 K2 19:07 0:18 nicht taglich¨ ph 19:04 NWB RS1 K2 19:22 0:18 nicht taglich¨ pa 19:34 NWB RS1 K2 19:52 0:18 nicht taglich¨ ph 20:04 NWB RS1 K2 20:22 0:18 nicht taglich¨ pa 20:34 NWB RS1 K2 20:52 0:18 nicht taglich¨ ph 21:04 NWB RS1 K2 21:22 0:18 nicht taglich¨ pa 21:34 NWB RS1 K2 21:52 0:18 nicht taglich¨ ph 22:04 NWB RS1 K2 22:22 0:18 nicht taglich¨ pa 22:34 NWB RS1 K2 22:52 0:18 nicht taglich¨ pa 23:06 NWB RS1 K2 23:24 0:18 nicht taglich¨ pa 23:34 NWB RS1 K2 23:52 0:18 nicht taglich¨ pa

Index

pa = 1. bis 5. Jul ps = Ab: Bremen Hauptbahnhof; An: Bahnhof Schonebeck,¨ Bremen; 1. bis 5. Jul pd = 5. Jul pf = Ab: Bremen Hauptbahnhof; An: Bahnhof Schonebeck,¨ Bremen; 5. Jul pg = Ab: Bremen Hauptbahnhof; An: Bahnhof Schonebeck,¨ Bremen; 1. bis 4. Jul ph = 1. bis 4. Jul pj = 4. Jul

Bremen-Schonebeck¨ bk Bremen Hbf

Ab Zug An Umsteigen Ab Zug An Dauer Verkehrstage 0:37 NWB RS1 K2 0:54 0:17 nicht taglich¨ pa 1:37 NWB RS1 K2 1:54 0:17 nicht taglich¨ ps 2:28 Bus N7 3:23 0:55 nicht taglich¨ pd 3:28 Bus N7 4:23 0:55 nicht taglich¨ pf 4:37 NWB RS1 K2 4:55 0:18 nicht taglich¨ pg 5:07 NWB RS1 K2 5:24 0:17 nicht taglich¨ pa 5:37 NWB RS1 K2 5:54 0:17 nicht taglich¨ pa 6:07 NWB RS1 K2 6:24 0:17 nicht taglich¨ pa 6:37 NWB RS1 K2 6:54 0:17 nicht taglich¨ pa 6:52 NWB RS1 K2 7:09 0:17 nicht taglich¨ pg 7:07 NWB RS1 K2 7:24 0:17 nicht taglich¨ pa

Alle Angaben ohne Gewahr¨ Softwareversion/Datenstand HAFAS-p2w-V1.4/DATAalls06 - 20.06.14

Fig. 4. Time table train Bremen Main Station to Bremen-Schnebeck Preface XIX Time table train

Bremen-Schonebeck¨ bk Bremen Hbf

Ab Zug An Umsteigen Ab Zug An Dauer Verkehrstage 7:22 NWB RS1 K2 7:39 0:17 nicht taglich¨ pg 7:37 NWB RS1 K2 7:54 0:17 nicht taglich¨ pa 7:52 NWB RS1 K2 8:09 0:17 nicht taglich¨ pg 8:07 NWB RS1 K2 8:24 0:17 nicht taglich¨ pa 8:22 NWB RS1 K2 8:39 0:17 nicht taglich¨ pg 8:37 NWB RS1 K2 8:54 0:17 nicht taglich¨ pa 8:52 NWB RS1 K2 9:09 0:17 nicht taglich¨ pg 9:07 NWB RS1 K2 9:24 0:17 nicht taglich¨ pa 9:37 NWB RS1 K2 9:54 0:17 nicht taglich¨ pa 10:07 NWB RS1 K2 10:24 0:17 nicht taglich¨ pa 10:37 NWB RS1 K2 10:54 0:17 nicht taglich¨ pa 11:07 NWB RS1 K2 11:24 0:17 nicht taglich¨ pa 11:37 NWB RS1 K2 11:54 0:17 nicht taglich¨ pa 12:07 NWB RS1 K2 12:24 0:17 nicht taglich¨ pa 12:37 NWB RS1 K2 12:54 0:17 nicht taglich¨ pa 13:07 NWB RS1 K2 13:24 0:17 nicht taglich¨ pa 13:37 NWB RS1 K2 13:54 0:17 nicht taglich¨ pa 13:42 Bus 91 13:54 Bahnhof Burg, Bremen 13:56 Bus 660 14:20 0:38 nicht taglich¨ ph 13:52 NWB RS1 K2 14:09 0:17 nicht taglich¨ pj 14:07 NWB RS1 K2 14:24 0:17 nicht taglich¨ pg 14:22 NWB RS1 K2 14:39 0:17 nicht taglich¨ pj 14:37 NWB RS1 K2 14:54 0:17 nicht taglich¨ pa 14:52 NWB RS1 K2 15:09 0:17 nicht taglich¨ pg 15:07 NWB RS1 K2 15:24 0:17 nicht taglich¨ pg 15:22 NWB RS1 K2 15:39 0:17 nicht taglich¨ pg 15:37 NWB RS1 K2 15:54 0:17 nicht taglich¨ pa 15:54 NWB RS1 K2 16:11 0:17 nicht taglich¨ pg 16:07 NWB RS1 K2 16:24 0:17 nicht taglich¨ pg 16:22 NWB RS1 K2 16:39 0:17 nicht taglich¨ pg 16:37 NWB RS1 K2 16:54 0:17 nicht taglich¨ pa 16:52 NWB RS1 K2 17:09 0:17 nicht taglich¨ pg 17:07 NWB RS1 K2 17:24 0:17 nicht taglich¨ pg 17:22 NWB RS1 K2 17:39 0:17 nicht taglich¨ pg 17:37 NWB RS1 K2 17:54 0:17 nicht taglich¨ pa 17:52 NWB RS1 K2 18:09 0:17 nicht taglich¨ pg 18:07 NWB RS1 K2 18:24 0:17 nicht taglich¨ pg 18:22 NWB RS1 K2 18:39 0:17 nicht taglich¨ pg 18:37 NWB RS1 K2 18:54 0:17 nicht taglich¨ pa 18:54 NWB RS1 K2 19:11 0:17 nicht taglich¨ pg 19:07 NWB RS1 K2 19:24 0:17 nicht taglich¨ pg 19:22 NWB RS1 K2 19:39 0:17 nicht taglich¨ pg 19:37 NWB RS1 K2 19:54 0:17 nicht taglich¨ pa 20:07 NWB RS1 K2 20:24 0:17 nicht taglich¨ pg 20:37 NWB RS1 K2 20:54 0:17 nicht taglich¨ pa 21:07 NWB RS1 K2 21:24 0:17 nicht taglich¨ pg 21:37 NWB RS1 K2 21:54 0:17 nicht taglich¨ pa 22:07 NWB RS1 K2 22:24 0:17 nicht taglich¨ pa 22:37 NWB RS1 K2 22:54 0:17 nicht taglich¨ pa 23:07 NWB RS1 K2 23:24 0:17 nicht taglich¨ pa 23:37 NWB RS1 K2 23:55 0:18 nicht taglich¨ pa

Alle Angaben ohne Gewahr¨ Softwareversion/Datenstand HAFAS-p2w-V1.4/DATAalls06 - 20.06.14

Fig. 5. Time table train Bremen Main Station to Bremen-Schnebeck XX Preface Time table train

Index

pa = 1. bis 5. Jul ps = 5. Jul pd = Ab: Bahnhof Schonebeck,¨ Bremen; An: Bremen Hauptbahnhof; 5. Jul pf = Ab: Bahnhof Schonebeck,¨ Bremen; An: Bremen Hauptbahnhof; 1. bis 5. Jul pg = 1. bis 4. Jul ph = Ab: Bahnhof Schonebeck,¨ Bremen; An: Bremen Hauptbahnhof; 1. bis 4. Jul pj = 4. Jul

Legende

K2 = nur 2. Klasse

Alle Angaben ohne Gewahr¨ Softwareversion/Datenstand HAFAS-p2w-V1.4/DATAalls06 - 20.06.14

Fig. 6. Time table train Bremen Main Station to Bremen-Schnebeck Preface XXI Campus Map

Fig. 7. Campus map Jacobs University

Conference Rooms

The Opening Ceremony, the plenary lectures and part of the semi-plenary talks take place in the Conference Hall in the Campus Center/IRC. The second track of the semi-plenary talks will be presented in Conrad Naber Lecture Hall which is situated in Reimar Lust¨ Hall. All contributed and invited sessions will take place in West Hall. Rooms 1 to 4 in West Hall are on the Ground Floor, Rooms 5 to 8 in West Hall are on the upper level. The LIS’2014 workshop takes place in the Semianr Room in the IRC (3rd ﬂoor, East Wing, pass through Group Study Area). XXII Preface Rehearsal Room

West Hall 7 is available as rehearsal and preparation room.

WLAN eduroam is available on campus. There is a second wireless network – Jacobs – which requires authentication.

Taxi

Funkwagen Susi fon: 0421 ? 62 11 11 (black cabs with plates starting OHZ-S ...)

On campus accommodation

ECDA 2014 guests who have booked on campus accommodation will mostly stay in College Nordmetall. Accommodation includes breakfast and dinner. Opening hours of the servery in College Nordmetall:

Mon - Fri Breakfast: 08:00 am - 09:30 am Dinner: 06:00 pm - 07:00 pm

Saturday Breakfast: 08:30 am - 10:00 am Dinner: 06:00 pm - 07:00 pm

Sunday and holidays Brunch: 10:30 am - 01:00 pm Dinner: 06:00 pm - 07:00 pm

In the foyer of College Nordmetall you can ﬁnd a vending machine with soft drinks and snacks. Please note, this vending machine can only be operated with cash. Preface XXIII Restaurants Bremen-Vegesack

Telefon Url Additional info Strandlust 0421 66090 www.strandlust.de Hotel/Restaurant/Beergarden Godenwind 0421 652575 www.goden-wind.de Traditional German Food /Fish Havenhaus 0421 664093 www.hotel-havenhaus.de Hotel/Traditional German Nielebock‘s Fisch und Mee(h)r 0421 89814315 www.nielebocksﬁschundmeehr.de Fish Nautico 0421 69 05 999 http://nautico-bremen.de Mediterranean Salento Classico 0421 8357900 http://www.salento-classico.de Italian Esszimmer 0421 651828 International Shelale 0421 2477514 http://www.restaurant-shelale.de/ Turkish/Anatolian Zur glsernen Werft 0421 6989489 www.zur-glaesernen-werft.de Selma die Kuh 0421 6369365 http://www.selma-die-kuh.de Bavarian Die Yacht 0421 3016828 http://www.die-yacht-bremen.de/ international Tinto 0421 6608885 http://www.tinto-bremen.de Spanish/Tapas/Wine El presidente 0421 20813260 http://www.elpresidente-bremen.de Mexican/Cocktails Cordoba 0421 662053 http://www.bistrocordoba.de/ German/Italian/Cocktails Social Program

Pre-Conference Welcome Reception, July 1, 2014, 20:00 - 22:00

The pre-conference welcome reception is open to all participants and is included in the conference fee. If weather permits we will mingle on the terrace in front of the campus center. Otherwise, you are welcome to join the crowd in the university club.

City Tour Bremen, July 3, 2014, 16:30

We will depart by bus to the city center of Bremen. You will enjoy walking around the city for about an hour and discovering the most beautiful corners and attractions of the city center, including the pic- turesque old town hall as well as the Roland statue that are jointly classiﬁed as a UNESCO World Heritage site. Certainly, you will also meet the Bremen town musicians.

Boat trip on River Weser, July 3, 2014, 18:30

The city tour will end at the Martini-Anleger at the Weser, the river running through Bremen and connecting the town to the North Sea. There we will embark the Graﬁn¨ Emma.

Conference Dinner, July 3, 2014; 20:00

Our boat trip will end in Bremen-Vegesack right in front of the Strand- lust where we are going to enjoy the conference dinner. Preface XXV Please note that the conference dinner is only included in the conference fees of GfKl’s and afﬁliated societies’ members and regular participants. All others can buy a ticket for the conference dinner sep- arately. A voucher will be handed over to you upon registration on the ﬁrst day of the conference. You will be asked to show this voucher before joining the event.

Part I

Scientiﬁc Program

1 XXX Session Overview

SessionOverview List View List ConferenceRoom Club Terrace/University Show Presentations Show Welcome: Pre-Conference Welcome Reception Pre-Conference Welcome Welcome: Location: Meeting 1: GfKl BoardMeeting GfKl 1: Meeting Location: - - 6:00pm 8:00pm 8:00pm 10:30pm 01/Jul/2014 Date:Tuesday,

XXXI West Hall 8 Hall West Eyke HüllermeierEyke CON-1F: Machine CON-1F: Learning and Knowledge Discovery II Location: Chair:

West Hall 6 Hall West Winfried J. Steiner J. Winfried CON-1E:Data Analysis I Marketing in Location: Chair: West Hall 5 Hall West Claus Weihs CON-1D:Data Analysis Interdisciplinary in Domains(Musicology I) Location: Chair: ConradNaber Hall Lecture 4 Hall West HansKestler Hans-JoachimMucha SP-2: Semi-PlenarySP-2: 2 Location: Chair: CON-1C:Statistics and DataAnalysis I Location: Chair: ssification and Data and Sciences ssification West Hall 3 Hall West KarstenLübke CON-1B:Data Analysis in Finance I Location: Chair: West Hall 2 Hall West Marlene Müller Marlene CON-1A:Machine Learning and Knowledge Discovery I Location: Chair: ConferenceHall ConferenceHall CampusCenter Foyer ConferenceHall ConferenceRoom CampusCenter East Wing 1 Hall West IRCSeminarRoom CampusCenter Foyer ConferenceHall ConferenceHall Adalbert F.X. Wilhelm Adalbert F.X. Wilhelm Adalbert F.X. MyraSpiliopoulou Daniel Baier ChristianHennig ChristianHennig Maurizio Vichi, Claus Weihs, Eyke Hüllermeir, Andreas Geyer-SchulzAndreas EykeHüllermeir, ClausWeihs, MaurizioVichi, Opening: Opening Ceremony Opening Opening: Location: Meeting 2:: GfKl Member 2::Meeting GfKl Meeting Location: Chair: PlenaryPlenary1: 1 Location: Chair: Semi-PlenarySP-1: 1 Location: Chair: Chair: Session: Invited INV-1: data: dimensional High Using all dimensions Location: Chair: 2 July Wednesday, LIS-2014 Workshop, LIS-2014-Wed: Location: Chair: BreakCoffee1:Break Location: Lunch 1: Lunch 1: Lunch Location: BreakCoffee2:Break Location: Cla in Publications of Future PlenaryThe Discussion: Panel 2: Location: Location: AG-DANK: Meeting AG-DANK AG-DANK:Meeting Location: ------9:30am 5:30pm 7:00pm 2:00pm 5:30pm 2:00pm 2:00pm 4:05pm 4:05pm 4:35pm 4:35pm 5:25pm 1:15pm 11:30am 11:30am 10:45am 10:45am 12:45pm 12:00pm 12:45pm 12:00pm 12:45pm 02/Jul/2014 Date:Wednesday, XXXII West Hall 7 Hall West AdamSagan StatisticsCON-3F: Data and Analysis IV Location: Chair:

West Hall 6 Hall West 6 Hall West AndreasGeyer-Schulz Joaquin Vanschoren CON-2D:Data Analysis SciencesSocial II in Location: Chair: Data CON-3E:Big Analytics Location: Chair: West Hall 5 Hall West 5 Hall West Friedhelm Schwenker Friedhelm Gaul Wolfgang INV-3: Invited Session: Invited INV-3: Dataanalysis in human-computer scenarios interaction Location: Chair: CON-3D:Data Analysis II Marketing in Location: Chair: ConradNaber Hall Lecture 4 Hall West ConradNaber Hall Lecture 4 Hall West Michael Greenacre Michael ArisPerperoglou Reinhold Decker RobertaSiciliano SP-4: Semi-PlenarySP-4: 4 Location: Chair: CON-2C:Statistics and DataAnalysis II Location: Chair: Semi-PlenarySP-6: 6 Location: Chair: CON-3C:Statistics and DataAnalysis III Location: Chair: West Hall 3 Hall West 3 Hall West KarinWolf-Ostermann Rebecca Nugent CON-2B:Data Analysis SciencesSocial I in Location: Chair: I CON-3B: Clustering Location: Chair: West Hall 2 Hall West 2 Hall West Themis Palpanas Themis PedroPereira CON-2A:Machine Learning and Knowledge Discovery III Location: Chair: CON-3A:Machine Learning and Knowledge Discovery IV Location: Chair: Rodrigues ConferenceHall ConferenceHall CampusCenter Foyer 1 Hall West IRCSeminarRoom CampusCenter East Wing ConferenceHall 1 Hall West UniversityJacobs Bremen Gate, Main Bremen-Vegesack Strandlust, Berthold Lausen Berthold HüllermeierEyke RobertoRocci Pigeot Iris Helmut Hildebrandt Lausen Berthold Location: Location: PlenaryPlenary3: 3 Location: Chair: Semi-PlenarySP-3: 3 Location: Chair: Session: Invited INV-2: RecentAdvances in MixtureModelling Location: Chair: 3 July Thursday, LIS-2014-Thur:LIS-2014 Workshop, Location: Semi-PlenarySP-5: 5 Location: Chair: Session: Invited INV-4: Healthtechnology assessmentof community foractive interventions healthyaging Location: Chair: Chair: Lunch 2: Lunch 2: Lunch Location: Location: Location: BreakCoffee3:Break Location: ------8:30am 9:15am 9:15am 4:30pmDeparture CityTour CityTour: 1:45pm 4:30pm 1:45pm 2:30pm 6:30pm8:00pmRiver on Weser Trip Boat Trip: Boat Conference Dinner: Dinner 2:35pm 4:15pm 11:00am 10:00am 10:30am 10:00am 10:30am 12:35pm 12:35pm 03/Jul/2014 Date:Thursday, XXXIII West Hall 6 Hall West 6 Hall West Colin Vance AlfredUltsch CON-4E:Data Analysis in Finance II Location: Chair: II CON-5E:Clustering Location: Chair: West Hall 5 Hall West 5 Hall West Ali Ünlü Vatolkin Igor CON-4D:Data Analysis in SciencesSocial III Location: Chair: CON-5D:Data Analysis in Domains Interdisciplinary II) (Musicology Location: Chair: West Hall 4 Hall West 4 Hall West ConradNaber Hall Lecture FrancescoMola Lippke Sonia Zhou Zhi-Hua CON-4C:Statistics and DataAnalysis V Location: Chair: CON-5C:Data Analysis in SciencesSocial IV Location: Chair: Semi-PlenarySP-8: 8 Location: Chair: West Hall 3 Hall West 3 Hall West AlfredInselberg ChristianHennig CON-4B: Visualization and CON-4B:Visualization Models Graph Location: Chair: CON-5B:Data Analysis in Domains Interdisciplinary Location: Chair: West Hall 2 Hall West 2 Hall West MyraSpiliopoulou Schwenker Friedhelm CON-4A: Machine Learning CON-4A: Machine and Knowledge Discovery V Location: Chair: Learning CON-5A: Machine and Knowledge Discovery VI Location: Chair: West Hall 1 Hall West CampusCenter Foyer 1 Hall West ConferenceHall CampusCenter East Wing Bernd Bischl Jozef Pociecha Claus Weihs BreakCoffee4:Break Location: Farewell: Farewell ReceptionFarewellFarewell: Location: INV-5: Invited Session: Invited INV-5: LargeScaleSVM Learning Location: Chair: Session: Invited INV-6: with Predictions models classification Location: Chair: Semi-PlenarySP-7: 7 Location: Chair: - - - - - 8:30am 1:35pm 1:35pm 3:00pm 11:00am 11:00am 10:35am 10:35am 12:50pm 12:40pm 04/Jul/2014 Date:Friday, 2 XXXV Session Details European Conference on Data Analysis 2014DataAnalysis on Conference European ConferenceRoom Club Terrace/University Meeting 1: GfKl Board Meeting Board GfKl 1: Meeting Location: Reception Pre-Conference Welcome Welcome: Location: - - 8:00pm 6:00pm 8:00pm 10:30pm Date: Tuesday, 01/Jul/2014 Date:Tuesday,

XXXVI , 1 2 West Hall 8 Hall West EykeHüllermeier Sylvie RatteSylvie Ecolede 1: Technologie Superioure,Escuela SuperiorPolitecnica Profit MeasureProfit Objective using Utility, Oriented Causalityand Domain Knowledge Maria Otilia Molina Alejandro

CON-1F: Machine CON-1F: Learning and Knowledge Discovery II Location: Chair: 2 , 1 nkage West Hall 6 Hall West Winfried J. Steiner J. Winfried Should finite mixture finite Should choice conjoint modelsaccountfor utility interdependencies? Paetz Friederike Winfried J. Steiner J. Winfried Clausthal 1: University Technology, of ClausthalGermany; 2: Universityof Germany Technology, CON-1E:Data Analysis I Marketing in Location: Chair: , Klaus West Hall 5 Hall West Claus Weihs Fast Model Based FastModel Tone of Optimization OnsetDetection by InstanceSampling Nadja Bauer of Recognition leitmotivesin Richard Wagner's Friedrichs, Claus Weihs Dortmund,Germany TU CON-1D:Data Analysis Interdisciplinary in (Musicology Domains I) Location: Chair: , 3 , 1 1 , , 2 3 , Lars 3 Conrad NaberLectureHall Conrad 4 Hall West HansKestler Hans-JoachimMucha Andre Busche Simone Miller Simone Michael Jungheim Martin Ptok Martin Schmidt-Thieme University 1: of Hildesheim,Germany; Brunel 2: Communications, Solving the Identity Crisis: Identity Large-Scale the Clusteringwith Solving RecordDistances of Distributions in Li Applications and Rebecca Nugent USA PA, Pittsburgh, Carnegie-MellonUniversity, UES of Prediction in Time Restitution AberrantSwallows Schilling Nicolas SP-2: Semi-Plenary2 SP-2: Location: Chair: CON-1C:Statisticsand DataI Analysis Location: Chair: c Datac West Hall 3 Hall West KarstenLübke Assessingsystemic Dynamic riskwith Beta Conditional approach KatarzynaKuziak University of Wroclaw Economics,Poland skewness of Power presence teststhe in financialfat tailed of distributions KrzysztofPiontek CON-1B:Data Analysis in Finance I Location: Chair: , Dieter , , Dieter , West Hall 2 Hall West Marlene Müller Marlene On the Influence of Influence the On Data Missing Decision on Methods Induction Tree Szillat Kristof Bagging Heterogeneous Decision Trees Wolff Fabian William Joenssen Ilmenau,Germany TU CON-1A:Machine Learning and Knowledge Discovery I Location: Chair: , Pedro , ConferenceHall ConferenceHall CampusCenterFoyer ConferenceHall ConferenceRoom CampusCenterEastWing 1 Hall West Adalbert F.X. Wilhelm F.X. Adalbert MyraSpiliopoulou Daniel Baier Hennig Christian Hennig Christian Contreras Do It Yourself: Exploratory Analysis of Very LargeScientifi ExploratoryAnalysis Very of Yourself: DoIt Palpanas Themis France ParisDescartes University, Marketing Paradigmand TheoryShift the – LatentVariables AdamSagan StorageLinearand Constant Potentially Time Hierarchical Clusteringthe Using BaireMetric and RandomSpanning Paths Murtagh Fionn high- Clustering for dimension, low- AG-DANK: Meeting AG-DANKAG-DANK: Meeting Location: Opening: Opening Ceremony Opening Opening: Location: Chair: 1 Plenary1:Plenary Location: Chair: Semi-Plenary1 SP-1: Location: Chair: Chair: Lunch 1: Lunch Location: Session: Invited INV-1: data: dimensional High Using all dimensions Location: Chair: BreakBreakCoffee1: Location: ------9:30am 1:15pm 4:05pm 2:00pm 2:00pm 11:30am 11:30am 10:45am 10:45am 12:45pm 12:00pm 12:45pm 12:45pm 12:00pm 02/Jul/2014 Date:Wednesday, XXXVII , Lars , , Joana 1 2 3 , Jonathan , Fernandes UniversityMinho, 1: of UniversityPortugal; 2: Minho,Portugal; of 3: Portugal Farfetch, Gabriel Mota Gabriel Forster, Athanassios Forster, Avramidis, Joerg Fliege Universityof Southampton,United Kingdom delEcoleLitoral;de 2: Technologie Superioure. Schmidt-Thieme Universityof Hildesheim,Germany Risk Analysis of User Behavior in Online Communities Towards Churn Alice Philippa Hiscock Fuel- Mining Driving Inefficient Behaviors From Trajectories GPS Grabocka Josif Based Signature A Fraud for Method DetectionE- on Commerce Scenarios Orlando Belo Experimental for design estimationthe of recoverydeep- of watermegafaunal assemblagesfrom hydrocarbon disturbance drilling Faroe- the in Marcin , , 1 2 k Thomas ą awUniversity of ł awUniversity of ł omowicz ł 1: Wroc 1: Andrzej B Andrzej Economics,Poland; 2: Wroc Economics,Poland Quality evaluation of evaluation Quality microeconometric models used in consumer preferencesanalysis Tomasz Bart Wine consumer preferenceanalysis of application with package of conjoint R AnetaRybicka Pelka University of Wroclaw Economics,Poland Net:Casting the CategorySpillover Effectsin Crowdfunding Platforms DieterWilliam Joenssen, Müllerleile Ilmenau,Germany TU , , 1 1 2 , , 1 , 1 Tim Crawford Tim Laurence DreyfusLaurence Goldsmiths, 1: UniversityLondon, of UnitedKingdom; 2: University, Oxford UnitedKingdom Christophe Rhodes Christophe David Baker music:chroma distancelistener and expertise Müllensiefen Daniel Surprising The CharacterMusic. of Search for A Music Sparsity in EvokedBody Movements.. Denis Amelynck Pieter-JanMarcMaes, Jean-Pierre Leman, Martens University, Ghent Belgium Aniterative learning dataset approachto demarcationin music analysis Srikanth Tidhar, Dan Wolff, Daniel Cherla, Tillman Weyde UniversityCity London, UnitedKingdom , 1 , , Eyke , 1 , Jens- 2 1 , Myra , 2 fication 2 Hüllermeier UniversityPaderborn, of Germany Alfonso Iodice D' Iodice Alfonso 1: Democritus 1: University 2:Greece; Thrace, of UniversityCassino of andSouthern Lazio, Italy Germany; 3: MedicalGermany; 3: HighSchool Hannover, Germany HenryVölzke Spiliopoulou Otto-von-Guericke 1: UniversityMagdeburg, UniversityGermany; 2: MedicineGreifswald, Germany PeterKühn Exploiting longitudinal data epidemiological similarity-based in classi Tommy Hielscher Incremental Generalized Canonical CorrelationAnalysis Markos Angelos Enza Rank Weighted Correlation MeasuresBased on OrderFuzzy Relations SaschaHenzgen , , , ź ód Ł ecka ł Philipp Zumstein Philipp Mannheim,UB Germany Wroclaw University of Wroclaw Economics,Poland A Hidden Markov Hidden A detect to Model relevance in documents financial topics off / basedon Kampas Dimitrios reborn Bornprint, the - digital HoppenstedtData Archive Schumm, Irene Weindel Sebastian Experimentaldesign VaR of evaluation in testsindependence MartaMa University of Poland Christoph Schommer, Christoph Sorger Ulrich Universityof Luxembourg, Luxembourg , , JEAN , , Asma , Werner , Aris 1 1 1 , 1 1 , Berthold , 2 , Osama 1 George Portides George UniversityNicosia, of Cyprus Comparison and StatisticalEvaluation Similarity of Measureson Lattice aConcepts in Domenach Florent information Textual in localization documentimages quadtree basedon decomposition PITOU CYNTHIA AnEnsemble of for Trees Optimal ClassMembership Probability Estimation Zardad Khan Gul Mahmoud Miftahuddin Miftahuddin Lausen Departmentof 1: MathematicalSciences, UniversityEssex, of UnitedKingdom; 2: BiometryDepartment of andEpidemiology, UniversityErlangen- Nuremberg Perperoglou Adler DIATTA UniversityReunion of FranceIsland, William Joenssen Ilmenau,Germany TU , 1 , Christian , 2 Hennig UniversityCollege London,United Kingdom Cinzia Viroli Cinzia UnitedUCL, Kingdom; 1: UniversityBologna, 2: of Italy sample-sizedata distance using vectors Terada Yoshikazu InformationCenter for andNeural Networks, Nationalof Institute Informationand Communications JAPAN. Technology, pattern-clustering A for method data- longitudinal heroin users receivingmethadone Lin Chien-Ju Quantile-based classifiers Hennig Christian XXXVIII , 1 , 1 1 , 2 , 1 1: Departmentof 1: Mathematical Sciences,University of United Essex, Kingdom;National 2: OceanographyCentre Universityof Southampton,UK MarwaBaeshen Miftahuddin Miftahuddin Berthold Lausen Berthold Shetland ChannelShetland Daniel Jones StavrosPoupakis s, Germany s, for mathematical for publications n Knowledge Organization Knowledge n l Publication Process: NetworksCogenerationCitationPublication The of l and challenges and iz Institute for the Socialthe Sciences,for Cologne, izInstitute Germany Classification and DataSciences and Classification 2 Systems andSystems Marketing,ChairInformationServices of and Electronic Market , Jan Steinberg Jan , 2 Andreas Oskar Kempf , Thorsten Koch Thorsten , , 1 , AndreasGeyer-Schulz , IRC Seminar Room Seminar IRC CampusCenterFoyer ConferenceHall ConferenceHall Adalbert F.X. Wilhelm F.X. Adalbert Konrad-Zuse-Zentrum,Germany Karlsruhe Institute of Technology, Institute of Information of Institute Technology, Karlsruhe of Institute Maurizio Vichi, Claus Weihs, Eyke Hüllermeir, Andreas Geyer-Schulz Hüllermeir, Eyke Claus Weihs, MaurizioVichi, Collaborative Literature Work in the Scientific &ScientificEducationa the in CollaborativeLiterature Work Leon Burkard ContextAnalysisFormal ContextPragmatics and Indexing: i Kleineberg Michael Humboldt-UniversitätBerlin,zu Germany basedgraphclassification recommendation system A PubSim: Gottwald Susanne opportunities - disambiguation author for indexing Subject Cornelia Hedeler 1: The University of Manchester, United Kingdom; 2: GESIS - LeibnUnited- Kingdom; GESIS 2: UniversityManchester, The of 1: Location: Plenary 2: Panel Discussion: The Future of Publications in in Publications of Future The Discussion: Panel 2:Plenary LIS-2014-Wed: LIS-2014 Workshop, Wednesday, July 2 July Wednesday, LIS-2014Workshop, LIS-2014-Wed: Location: Chair: MemberMeeting 2::GfKl Meeting Location: BreakBreakCoffee2: Location: - - - - 2:00pm 5:30pm 5:25pm 4:35pm 7:00pm 5:30pm 4:05pm 4:35pm XXXIX

, , 1 2 West Hall 6 Hall West AndreasGeyer-Schulz MartaDziechciarz- Duda UniversityEconomics of Poland inWroclaw, Lukas Sobisek Lukas 1: Faculty of Economics, of Faculty 1: BelMatejUniversity, SlovakRepublic; 2: Informatics of Faculty Universityand Statistics, Economics,Prague of Utilization of Panel of Utilization DataAnalysis to RiskPredict of the PovertyEU of Households StachovaMaria fuzzy the Applying identify settheoryto non-monetary the factorspoverty of Przybysz Klaudia CON-2D:Data Analysis SciencesII Social in Location: Chair: , West Hall 5 Hall West Friedhelm Schwenker Friedhelm A Study on the on Study A ImpactAdditional of on Modalities AutomaticEmotion Recognition Jonghwa Kim UniversitätAugsburg, Germany and Analyzing multimodal labeling human- datain computerinteraction using ATLAS SaschaMeudt AutomatedPain System Recognition Friedhelm Schwenker Friedhelm Ulm University, Germany INV-3: Invited Session: Invited INV-3: Datain analysis human-computer scenarios interaction Location: Chair: , Aris ska ń Conrad NaberLectureHall Conrad 4 Hall West Michael GreenacreMichael Perperoglou Aris Perperoglou UniversityEssex, of UnitedKingdom Visualization and Data Mining for High DimensionalData High for Data and Mining Visualization Inselberg Alfred Israel University, Aviv Tel The analysis of incompletemulti- the waytableswith log-linear useof models JustynaBrzezi UniversityEconomics of inKatowice,Poland of Weight The PenaltyOptimization Ridge for Regression UtamiZuliana Sri Correlated component regression:Profiling student SP-4: Semi-Plenary4 SP-4: Location: Chair: CON-2C:Statisticsand DataII Analysis Location: Chair: , 1,2 manceMeasures ee , Sonia , 3,4 West Hall 3 Hall West Karin Wolf-Ostermann Karin Lippke JacobsUniversity BremengGmbH, Germany The Effects The of Well- on Parenthood Being Samoilova Evgenia literacy health How facilitateshealthy lifestylehabits: Analysesdata of study fromonline an Paech Juliane of Validation questionnairesusing the and trial pilot a Vance Colin 3: GESIS; 2: BIGSSS; 1: JacobsUniversity; 4: Essen RWI CON-2B:Data Analysis SciencesI Social in Location: Chair: , , West Hall 2 Hall West Themis Palpanas Themis Featureselectionfor kerneladditive classifiers BiermanSurette semantics Utilizing multi- guiding for classifiersystems Ludwig Lausser Characterizing Nelmarie Louw, Sarel NelmarieLouw, Steel StellenboschUniversity, SouthAfrica Florian Schmid, Florian Johann Kraus, Axel A.Hans Fürstberger, Kestler Ulm University, Germany CON-2A:Machine Learning and Knowledge Discovery III Location: Chair: , , 2 1 ConferenceHall ConferenceHall CampusCenterFoyer 1 Hall West Berthold Lausen Berthold EykeHüllermeier Rocci Roberto Roberto Rocci Roberto Sapienza, 1: University IGF 2: Italy; Rome, of UniversityDepartment, Rome, Vergata, Tor of Antonio Punzo Antonio Learning with Big DataBig Perfor Incrementalof by with Optimization Learning Zhou Zhi-Hua China, People's Republic of NanjingUniversity, DecisionTr in Speed-Interpretability-Accuracy The Trade-Off Learning Siciliano Roberta UniversityNaplesItaly of Federico II, Clustering heteroscedasticdata Ritter Gunter Diagnosticsfor model-based via clustering mixturemodelswith covariates SalvatoreIngrassia modelsfor Mixture data: a ordinal pairwise likelihood approach Ranalli Monia Plenary 3: Plenary 3 Plenary3:Plenary Location: Chair: Semi-Plenary3 SP-3: Location: Chair: Session: Invited INV-2: RecentAdvances in Modelling Mixture Location: Chair: BreakBreakCoffee3: Location: - - - - 8:30am 9:15am 9:15am 10:30am 10:00am 10:00am 10:30am 12:35pm 03/Jul/2014 Date:Thursday, XL

2 , Ali Ünlü 1 Orderedlogistic to asmodeltool a the identify determinantsof povertyrisk among households Polish Woloszyn Andrzej UniversitySchool of PhysicalEducation in Poznan,Poland MultivariateLogistic Mixtures Liu Xiao FastDD- classificationof data functional Pavlo Mosler, Karl Mozharovskyi UniversityCologne, of Germany 1: TUM SchoolTUM of1: Education,Technische UniversitätMünchen, TUM Germany; 2: SchoolEducation, of Universität Technische München,Germany , e of interestingweb-features of e for 1 2 , , 1 , 2 1 , Junwen 1 , Kerstin , 1 on the Basis the of on and Biopotentials Video Recording Walter Steffen Ayoub Al-Hamadi Ayoub Ulm University, 1: UniversityGermany; 2: Magdeburg of Philipp Werner Philipp Harald C. Traue Tan SaschaGrus Limbrecht-Ecklundt , 1 , Ali , Iris , Manuel , 1 , Christian , 2 3 Ferreira Rautert Dortmund, TU 1: Statistics, Department of Dortmund,Germany;2: Dortmund,Institute TU Materials of Engineering,Dortmund, TU Germany; 3: of Dortmund,Institute MachiningTechnology, Dortmund,Germany Ünlü University Technical Munich,Germany Lehmann, Jörg Clemens Hennersdorf, Martin Deilmann, Behnisch Leibnizof Institute EcologicalUrban and RegionalDevelopment, Germany Claus Weihs performancesby means of background characteristics Bernhard Gschrey DataEnvelopment AnalysisCity for Efficiency Daniel Reissmann a of Optimization for Simulation Inhomogeneous Subsoil Mineral Machining Herbrandt Swetlana , eutschland;KlassifikationGesellschaft für 3: e.V. , Faith , classification literature // Online report 2013: A choic report 2013: Online A classification literature // , Louise 1,2 , Joanna 3 , Berthold , 4 4 1 , Samantha , , Terri 3 3 English Longitudinal English Ageing of Study Adi Florea Life-Style Student RevisitedValues, - and Attitudes Behavior AndreasGeyer-Schulz, Thomas Hummel MultiTrait- (MTMM) MultiMethod model in and CFA comparative 9 7, analysis 5, of scales point 11 and Tarka Piotr PoznanUniversity of Economics,Poland Jackson Lausen Departmentof 1: MathematicalSciences, UniversityEssex, of UnitedKingdom; 2: Departmentof University Psychology, United Essex, of Kingdom;Colchester 3: UniversityHospital, UnitedKingdom; 4: SchoolHuman of Health Sciences,University of United Essex, Kingdom Marsland Gage Victoria-Anne Schweigert Germany KIT, Reynolds Head , , , 1 1 Holger , 1 2,3 , Zardad , Berthold , 1 2 2 1 , Metodi , 1 Metodiev Khan Lausen Departmentof 1: MathematicalSciences, UniversityEssex, of UnitedKingdom; 2: ProteomicsSchool Unit, Biological of Sciences, UniversityEssex, of UnitedKingdom Asma Gul Ludwig Lausser, Hans Ludwig Lausser, Kestler A. Ulm University, Germany Andrew Harrison featureselection algorithms Schirra Lyn-Rouven NetworkData and for Integration BiomarkerSignature Discovery via NetworkSmoothed T-Statistics Yupeng Cun Fröhlich UniversityBonn, 1: of UniversityGermany; 2: Bonn,Germany of Minimizing Redundancy among GenesSelected Basedthe on Overlapping Analysis OsamaMahmoud Michael Franke Michael , 1,3 IRC Seminar Room Seminar IRC classifications Fachbereich FreieMünchen;UniversitätFHVR, Berlin, 2: AuB, 1: D Italy relevant of choice decimal Report 2013: Bibliographic A Bernd WJ. Lorenz LIS-2014-Thur: LIS-2014 Workshop, Thursday, July 3 July Thursday, LIS-2014Workshop, LIS-2014-Thur: Location: - 4:30pm 11:00am XLI , West Hall 7 Hall West AdamSagan Marcin Pelka Marcin Specialization in Specialization SmartGrowth SectorsEffectsvs. Workforce of Number Changes European the in Union Regional Space Sobczak Elzbieta University of Wroclaw Economics,Poland Comparison of conditions working European in countrieswith respectgender, to StatisticsCON-3F: DataIV Analysis and Location: Chair: , West Hall 6 Hall West Joaquin Vanschoren Missing Data Missing DataBig for Methods Analysis DieterWilliam Joenssen Universität Technische Ilmenau,Germany DataBig Oriented DataSymbolic Analysis in Cloud MINAMI Hiroyuki DataBig Analytics Classicalvs. Data Science Masahiro MIZUTA Masahiro HokkaidoUniversity, Japan CON-3E: Big Data CON-3E:Big Analytics Location: Chair: , 1 fits , Anett , , Sarah , 2 West Hall 5 Hall West Wolfgang Gaul Wolfgang ffectsfor Frost, Ines Daniel Ines Frost, Cottbus- BTU Senftenberg,Germany Evaluating Advertising Campaigns Using ImageData Analysis Classification and Daniel Baier Accommodating Heterogeneityand Price Nonlinearityin E BrandPredicting ProSales and Steiner J. Winfried StefanLang CON-3D:Data Analysis II Marketing in Location: Chair: , 1 , Conrad NaberLectureHall Conrad 4 Hall West Reinhold Decker Siciliano Roberta Iannis Papadimitriou, Iannis Vasakidis Athanasios UniversityMacedonia, of AppliedDepartment of 156 EgnatiaInformatics, GR-54006 Street, Thesslaoniki,Greece. On a Decision-Maker Without Preferences Decision-Makera On Without AndreasGeyer-Schulz cluster Employing the analysis for incometax of study Greece of law Leonidas Tokou A comparison of model- heuristic and basedclustering dietary for methods patternanalysis Claudia Börnhorst SP-6: Semi-Plenary6 SP-6: Location: Chair: CON-3C:Statisticsand DataIII Analysis Location: Chair: , 1 ch tion of a Discoverya of tion System ndexing of social sciencesocial of data ndexing of the Leibnizthe of Association, Germany owski ł ledge , Danuta , 2 verview about the status quo and its application its and statusverviewquo the about West Hall 3 Hall West Rebecca Nugent gorzata ł Ma Markowska A BiclusteringModel A a for Method and SparseBinary Data Imaizumi Tadashi Japan University, Tama Geographic through clustering controlaggregation DAHER Ayale UniversitédeBretagne, France Three-wayclustering problems in regional science Soko Andrzej CON-3B: Clustering I CON-3B:Clustering Location: Chair: West Hall 2 Hall West Pedro Pereira Pedro Monitoring dynamic Monitoring majority weighted with method Adaptivecontrol chartreal based on datasetswith conceptdrift Mejri Dhouha Universität Technische Dortmund,Tunisia of Ensemblek- of NearestNeighbour ClassifiersClass for Membership Probability CON-3A:Machine Learning and Knowledge Discovery IV Location: Chair: Rodrigues , Jessica Drechsler, Robert Strötgen, EstherChen Strötgen, Robert Jessica , Drechsler, , PascalSiegers , CampusCenterEastWing ConferenceHall 1 Hall West Iris Pigeot Iris Hildebrandt Helmut Lausen Berthold Georg Eckert Institute for International Textbook Research. Member International for Textbook Institute Georg Eckert GESIS, Germany GESIS, Classificationsystems libraries: Ano german public in Seeger Frank ElasticSearMetadata Bibliographic Analyzing with and Storing Clemens Düpmeier Germany (KIT), Techology Karlsruhe of Institute Construc the Challenges- in Textbooks of Indexing Subject PramannBianca survey of aboutness and ofnessimprovedquestions: The i Tanja Friedrich uncertaintyHow simultaneouslyknow driveshinders and discoverydata from health PereiraRodrigues Pedro Portugal UniversityPorto, of Business the in intelligence contextintegrated of caresystems (ICS): experiencesfrom the "GesundesICS in Kinzigtal" Germany Pimperl Alexander GermanyOptiMedis AG, Samplesize considerationsfor community-based Lunch 2: Lunch 2: Lunch Location: SP-5: Semi-Plenary5 SP-5: Location: Chair: Session: Invited INV-4: technology Health assessmentof community active for interventions aging healthy Location: Chair: Chair: - - - 1:45pm 2:30pm 1:45pm 2:35pm 4:15pm 12:35pm XLII ska ń , 2 1 , Nikos 1 2 awUniversity of ł Michael GREENACRE Anadolu 1: University, Pompeu 2: Turkey; FabraUniversity, Spain kapoulas Universitrytof 1: Macedonia,2:Greece; Universitrytof Macedonia,Greece age and educationage and ASAN Zerrin GREENACRE THE IDENTIFICATION RELATIONS OF SMARTBETWEEN GROWTH AND TO SENSITIVITY CRISISTHEIN EUROPEANUNION REGIONSPANEL – ANALYSIS DATA BeataBal-Doma Wroc for model A comparison of government on expenditure servants civil – Gross wages and salariesEU24 in Joel Chiadikobi Nwaubani Economics,of Faculty Economics, Managementand Poland Tourism, , Eyke , TU Dortmund,Germany TU Epistemic Uncertainty Active for Sampling Data on Learning Streams AmmarShaker Hüllermeier UniversityPaderborn, of Germany Claus Weihs , 2 , Peter , 1 DiscreteChoice Brand for Models PriceTrade-Off PeterKurz Infratest TNS DeutschlandGmbH, Germany Lead User Classificationfor DataAnalysis in Marketing Sänn Alexander Weber Daniel Baier Cottbus- BTU Senftenberg,Germany Wechselberger Clausthal 1: University Technology, of UniversityGermany; 2: AustriaInnsbruck, of , Iris 1,2 1,2 Pigeot Leibniz-Institute 1: for PreventionResearch andEpidemiology - Germany; GmbH, BIPS UniversityBremen, 2: of Mathematics of Faculty andComputer Science, Germany Benjamin Greve Reificationof subjectivevehicle impressions - the of Objectification individual perceived from head-quality images up-display Köppl Maria Sonja Dortmund,Germany TU Reificationof subjectivevehicle impressionsApply - classification predict to methods perceived the quality fromhead-up- display images Köppl Maria Sonja Dortmund,Germany TU 2 awUniversity of ł Strahl Cracow University1: of Economics,Poland; 2: Wroc Economics Evaluating the Evaluating Necessitya of Distance Triadic Model Nakayama Atsuho Metropolitan Tokyo Japan University, ski 1 , ń y ł , , Werner 1 , Claudio 1 , , Zardad 1 1 , Aris , Osama 2 1 Estimation Asma Gul Multivariate functional regression analysis to application with classification problems Górecki Tomasz Assessingthe multi- a of reliability classclassifier Luca Frigau Mahmoud Miftahuddin Miftahuddin Khna Perperoglou Adler Lausen Berthold Departmentof 1: MathematicalSciences, UniversityEssex, of UnitedKingdom; 2: BiometryDepartment of andEpidemiology, UniversityErlangen- Nuremberg Conversano, Mola Conversano, Francesco UniversityCagliari, of Italy Waldemar Wo Adam Mickiewicz Poland University, , Adi 1 , Berthold , 1 , Peter , 2 , Hongsheng 1 Main Gate, JacobsGate,University Bremen Main Bremen-Vegesack Strandlust, 1,2 , Aris 3 , Helge 1 Dai Gillmeister Lausen Departmentof 1: MathematicalSciences, UniversityEssex, of UnitedKingdom; 2: Departmentof University Psychology, United Essex, of ofKingdom; Institute 3: Socialand Economic Research,University of United Essex, Kingdom Florea Perperoglou Lynn intervention studies intervention without and with vignettes anchoring StavrosPoupakis Lifestyle Testing Theorieswith DifferentData Analyses:Modelling Behavior Multiple Change in Behavioral and CareHealth Sciences Lippke Sonia JacobsUniversity Bremen,Germany Location: Location: 4:30pmTour DepartureCity Tour: City 6:30pm8:00pmRiverWeser on Trip Boat Trip: Boat ConferenceDinner Dinner: XLIII ur , , Joachim 1 West Hall 6 Hall West 2 Colin Vance Radoslaw Pietrzyk, PawelPietrzyk, Radoslaw Rokita University of Wroclaw Economics,Poland Facilitating household Facilitating plan financial by optimization timerange of adjusting life-length analysisto risk aversion LukaszFeldman Constructingcumulated scenarios cashnet flow two- underlying an with person survival process financial household for planning Rokita Pawel University of Wroclaw Economics,Poland Firm-specific determinantson dividend changes: from data insights mining KarstenLübke ExcessTakeover PremiumsTakeover and ContestsAnAnalysis- Different of Approaches Determining for AbnormalPricesOffer Rojahn HochschuleFOM 1: für OekonomieManagement,& DIPS Germany; 2: f Deutsches Institut Portfolio-Strategien CON-4E:Datain Analysis Finance II Location: Chair: 2 , , 2 1 , Nale 1 , 1 , Sven C.Sven , , Sven C.Sven , 1 3 , Christiane A. L. A. Christiane , , Astrid C. Astrid , 1 1 West Hall 5 Hall West 2 2 Ali Ünlü Sven Constantin Voelpel Constantin Sven 1: Jacobs University1: Bremen,Germany;2: JacobsUniversity Bremen, Germany Voelpel Jacobs University1: Bremen,Germany;2: Horstmeier Homan Annelies E. M. Van M. E. Annelies Vianen 1: Jacobs University1: VU Bremen,Germany; 2: UniversityAmsterdam, Netherlands;Goethe 3: UniversityFrankfurt, Germany Voelpel Diana Boer Lehmann-Willenbrock Making SenseMaking of Data:Qualitative An Gioia the of Application Method HeikeGerpott Fabiola Path Multilevel Applying Analysis:the Analyzing Work Leaders’ Roleof Engagementfor Work Subordinates’ Engagement Gutermann Daniela Context: in Mentoring of AnApplication Mediation Multilevel in Models Research Organizational Doris Rosenauer CON-4D:Datain Analysis SciencesIII Social Location: Chair: , 1 , 1 1 , Jean- 2 , Hiroyuki , Ekaterina , West Hall 4 Hall West FrancescoMola Chernyak RussianNRU HSE, Federation Yves Lechevallier Daniel Lan-Sun-Luk Daniel Jean-PierreChabriat Université 1: delaRéunion, , RéunionINRIA 2: (France); Rocquencourt(France) Minami Japan HokkaidoUniversity, A comparisonfor study A spectral,ensemble and spectral-meanshift approachesclustering interval-valued for symbolicdata Pelka Marcin University of Wroclaw Economics,Poland solar Clustering and radianceprediction Henri Ralambondrainy MDS Functional Moving to Application its and PostData in Monitoring FukushimaPrefecture Mizuta Masahiro annotatedUsingsuffix treesimilaritymeasure text for summarisation MaximYakovlev Markov Hidden Using Improve to dels Mo Analyzing AccelerometerData CON-4C:Statisticsand DataV Analysis Location: Chair: , 1,2 , Tobias 1 , Lars 1 , Horst HahnHorst , , Andreas 1,2 West Hall 3 Hall West Alfred Inselberg Alfred Geyer-Schulz Preusser Lars Linsen Jacobs University, 1: Bremen,Germany;2: Bremen, FraunhoferMEVIS, Germany Linsen JacobsUniversity Bremen, Germany UncertaintyMedical in Datathe Analysis in CaseCarotid of Vessel Visualization Ristovski Gordan AnalysisMulti- of Visual Spatio-temporalrun Data Simulation AlexeyFofonov shapeeffects and Size biplots in GreenacreMichael UniversitatPompeu Fabra, Spain Reviewing Graphical Multivariate of Modelling Processes Temporal EckardtMatthias Humboldt-Universitätzu Berlin,Germany Learning Hierarchical Document Classificationsfrom Recommender Graphs: of AnApplication Clustering Modularity Fabian Ball CON-4B: Visualization and CON-4B:Visualization Models Graph Location: Chair: , , 1 , Xavier , 1 , 1 , Yves 1 Friedhelm , Eyke , , MarcPirlot , 2 West Hall 2 Hall West 1 MyraSpiliopoulou Hüllermeier UniversityPaderborn, of Germany Nabil Ait Taleb Ait Nabil Siebert Philippe Fortemps Philippe Surette BiermanSurette UniversityStellenbosch, of SouthAfrica Desmet 1: UMONS, Belgium; UMONS, 2:1: UniversitéLibre deBruxelles Multi-label classification Multi-label multivariate using linear regression Steel Johannes Sarel ActiveMulti-Instance learning Multi-Label Retz, Robert Schwenker UniversityUlm,Germany of Datasets Generationof Ranking Label for Gurrieri Massimo SubsetCorrection for Multi-Label Classification Senge Robin Supervised ClassificationViral of Genomes based on RestrictionSite Distribution Remita,Amine Mohamed Abdoulaye Ahmed Halioui, Banire Diallo CON-4A: Machine Learning CON-4A: Machine and Knowledge Discovery V Location: Chair: n , Ayd , Steenstrup , glu, Bernd Bischl, West Hall 1 Hall West Bernd Bischl Pedersen, Christian Igel Christian Pedersen, UniversityCopenhagen, of Denmark Demircio Tobias Glasmachers,Claus Weihs Linear SVM Training SVM Linear Adaptation Online with Coordinate of Frequencies Tobias Glasmachers Stochasticgradient solving for algorithms large-scalelearning problems Lee Sangkyun Dortmund,Germany TU Comparative of Study A KernelizedSupport Machines Vector Daniel Horn Vector Support MachinesActive for Learning Jan Kremer INV-5: Invited Session: Invited INV-5: ScaleLearning Large SVM Location: Chair: - 8:30am 10:35am 04/Jul/2014 Date:Friday, XLIV , 2 , 1 3 Colin 2 , 1 West Hall 6 Hall West Alfred Ultsch Alfred Antoine Cornuéjols Antoine Orange 2:1: Labs,France; Orange3:Labs,France; AgroParisTech hans-Georg Bartel hans-Georg Germany;2: WIAS, 1: Berlin HumboldtUniversity, Oumaima Alaoui Ismaili Schneck Justus-LiebigUniversity Giessen,Germany Intervalestimation of and Value-at-Risk Expectedfor Shortfall ARMA-GARCHmodels KrzysztofPiontek University of Wroclaw Economics,Poland Supervised pretreatmentsare useful supervised for clustering Lemaire Vincent Variable Bottom-up Cluster in Selection Analysis Using Bootstrapping:A Proposal Hans-JoachimMucha Bayesianof clustering the datain functional Wolfgang Bessler, Bessler, Wolfgang CON-5E: Clustering II CON-5E:Clustering Location: Chair: , Tim , Satoru , , Geoffray , West Hall 5 Hall West Igor Vatolkin Igor UniversityAmsterdam, of NetherlandsThe Crawford, Markd'Inverno Crawford, Goldsmiths,University of London,United Kingdom Yokoyama Japan University, Tama Nonhierarchical AsymmetricCluster Analysis of AmongRelationships ManagersatFirm a Akinori Okada digital of Identification ICT using profiles skills usagedata Rozkrut Antoni Dominik UniversitySzczecin; of inSzczecin, Office Statistical Poland Duplicatedetectionin facsimilescans early of music printed Rhodes Christophe Combining Audio FeaturesPlaylist and StatisticsImproved for Category Music Recognition Vatolkin Igor A - Lab Music Digital Frameworkfor Music BigAnalysing Data Bonnin, DietmarJannach Bonnin, Dortmund,Germany TU CON-5D:Datain Analysis Domains Interdisciplinary II) (Musicology Location: Chair: , , 1 2 1 , Daniel , Iris 1 2 , Vitali , 1 , Diana Boer 1 , Ronja Foraita Ronja , 1 West Hall 4 Hall West , Sven C. Voelpel C. Sven , 1 2 Sonia Lippke Sonia Balliet Jacobs University1: Vrije Bremen,Germany; 2: UniversiteitAmsterdam Yannis Pitsiladis Yannis Pigeot Leibniz-Institut 1: für Präventionsforschungund EpidemiologieBIPS, - UniversityGermany; 2: of UKBrighton, Witowski Social Value Orientation Value Social Expectations and of Social Cooperationin Meta Dilemmas:A Analysis Pletzer Luca Jan Meta-AnalyticUsing StructuralEquation (MASEM)to Modeling testmodelsin new research: organizational The example of transformational leadershipeffects on work identification.at L. A. Christiane Horstmeier Norman Wirsik CON-5C:Datain Analysis SciencesIV Social Location: Chair: , West Hall 3 Hall West Christian Hennig Christian Hongsheng Dai, Yoshiko Lausen Berthold Hayashi, A Bayesian approachto A testmatchinglaw the observational data with Johannes Zschache UniversityLeipzig, of Germany Data Missing Handling Non-Gaussian in Hierarchical Bayesian DynamicModels Casper Albers UniversityGroningen, of Netherlands,The Bayesiananalysis for mixturesdiscrete of non- a with distributions parametriccomponent Baba Bukar Alhaji Karlsruheof Institute Germany Technology, CON-5B:Datain Analysis Domains Interdisciplinary Location: Chair: , 2 , 1 , Mark , 2 , Markus , 2 1 , PeterSozou , 1 , Christoph , 2 , Uwe Schöning Uwe , West Hall 2 Hall West 2 3 Friedhelm Schwenker Friedhelm Maucher BioinformaticsLaboratory, UniversityQuebec of at Montreal,Canada Analysing Data Psychological by Computational Evolving Models PeterLane Theoretic Information MeasuresAnt for Optimization Colony Völkel Gunnar Addis University 1: of Hertfordshire,United Kingdom;University 2: of Liverpool,United Kingdom; Birmingham 3: City United Kingdom University, Müssel Hans A. Kestler A. Hans Fernand Gobet CON-5A: Machine Learning CON-5A: Machine and Knowledge Discovery VI Location: Chair: , Barbara , Barbara CampusCenterFoyer 1 Hall West Jozef Pociecha Jozef Pawelek, Jozef Pociecha Jozef Pawelek, CracowUniversity of Economics,Poland Pawelek, MateuszPawelek,Baryla CracowUniversity of Economics,Poland DynamicAspects of BankruptcyPrediction Models Pociecha Jozef Random Simple With Sampling Replacementas a Companies of Technique Corporate in Selection BankruptcyPrediction MateuszBaryla hybrid useof The predictiveC&RT-logit BreakBreakCoffee4: Location: INV-6: Invited Session: Invited INV-6: with Predictions models classification Location: Chair: - - 11:00am 11:00am 10:35am 12:40pm XLV , Hiroshi 1 , Christophe , 2 on and clustering and on Abraham, Bénédicte Abraham, Fontez MontpellierSupAgro, France Yadohisa Graduate School1: of Cultureand Information Science,Doshisha Japan;2: University, CultureDepartment of and InformationScience, Japan DoshishaUniversity, presencecovariates of Damien Juery K-modeClusteringwith DimensionalReduction Categorical for Data Kensuke Tanioka , , , 2 1 3 4 , 3 , Stephen , 1 , Mathieu , 2 , Jason 1 , Daniel Wolff Daniel , , Nicolas Gold , Emmanouil , 1 Tillman Weyde 2 1 , MarkPlumbley , 1 Mahendra Mahey Mahendra UniversityCity London,1: Queen Mary2: UK; UniversityLondon,3: of UK; UniversityCollege London, Library, British The 4: UK; UK SamerAbdallah Barthet Dan Tidhar Simon Dixon Simon Dykes Benetos Cottrell Tillman Weyde Machine Learning for MachineLearning AnalysisLargea the of Musical of Collection Scales Dan Cherla, Srikanth d'Avlia Artur Tidhar, Garcez, UniversityCity London, UnitedKingdom 2 , 1 , Sven C.Sven , 3 2 , Martin Schrepp Martin , 1 1 Conrad NaberLectureHall Conrad Zhi-Hua Zhou Zhi-Hua Nale Lehmann- Willenbrock Jacobs University1: VU Bremen,Germany; 2: Amsterdam,Netherlands 1: Technische Universität Technische 1: SAP München,Germany; 2: Germany AG, Biasing EffectsBiasingNon- of representativeSamples the Quasi-orders in of Assessmentof Recoveryof Quality ItemHierarchy IITA-type Mining Ali Ünlü Dynamic:get Let's InteractionAnalysis HeikeGerpott Fabiola classificati Partiallyin supervised algorithms learning Schwenker Friedhelm Germany Ulm University, Astrid C. Homan C. Astrid Voelpel Jacobs University1: Bremen,Germany;2: GoetheUniversity Frankfurt, UniversityGermany; 3: of Amsterdam,The Netherlands SP-8: Semi-Plenary8 SP-8: Location: Chair: University of Essex, UK, University Essex, of UnitedKingdom Exploring Unknown Exploring Terrain HerzogIrmela The Rhineland Regional Germany Council(LVR), , 1 , 2 , Adi 1 1 , Hermann , Benjamin 1 , AndreasMayr , , Stavros , 2 1 Pohlabeln, Diana Pohlabeln, Wolfgang Herrmann, Pigeot Iris Ahrens, Leibnizfor Institute PreventionResearch and EpidemiologyGmbH, BIPS - Germany MarwaBaeshen Poupakis Hofner Florea Berthold Lausen Berthold 1: Departmentof 1: MathematicalSciences, UnitedUniversity Essex, of Kingdom;Departmentof 2: Biometry andEpidemiology, UniversityErlangen- Nuremberg Modelling sea surface Modelling temperaturethe in Indian Ocean using gamboostLSS Miftahuddin Miftahuddin Estimatingage-and height-dependent percentilecurves for GAMLSS using children study IDEFICS the in Intemann Timm 1: Institute of Theoretical of Institute 1: ComputerScience,Ulm Germany;2: University, MedicalBiology, Systems Germany Ulm University, ski ń apczy Ł ConferenceHall CampusCenterEastWing Claus Weihs models in analytical modelsin CRM Mariusz CracowUniversty of Economics,Poland machinescience learning Open in OpenML: Joaquin Vanschoren Belgium EindhovenUniversityTechnology, of Farewell: Farewell Reception Farewell Farewell: Location: SP-7: Semi-Plenary7 SP-7: Location: Chair: - - 1:35pm 1:35pm 3:00pm 12:50pm Contents

Part I Scientiﬁc Program

Session Overview ...... XXIX

Session Details ...... XXXIV

Part II Keynotes

Do It Yourself: Exploratory Analysis of Very Large Scientiﬁc Data Themis Palpanas ...... 3 Latent Variables and Marketing Theory – the Paradigm Shift Adam Sagan ...... 4 Solving the Identity Crisis: Large-Scale Clustering with Distributions of Distances and Applications in Record Linkage Rebecca Nugent ...... 5 Learning with Big Data by Incremental Optimization of Performance Measures Zhi-Hua Zhou ...... 6 The Speed-Interpretability-Accuracy Trade-Off in Decision Tree Learning Contents XLVII Roberta Siciliano ...... 7 Visualization and Data Mining for High Dimensional Data Alfred Inselberg ...... 8 How uncertainty simultaneously drives and hinders knowledge discovery from health data Pedro Pereira Rodriguez ...... 9 On A Decision-Maker Without Preferences Andreas Geyer-Schulz ...... 10 Partially supervised learning algorithms in classiﬁcation and clustering Friedhelm Schwenker ...... 11

Part III Invited Sessions

Invited Session 1: High-dimensional data: using all dimensions 15 Linear Storage and Potentially Constant Time Hierarchical Clustering Using the Baire Metric and Random Spanning Paths Fionn Murtagh, Pedro Contreras ...... 16 Clustering for high-dimension, low-sample-size data using distance vectors Yoshikazu Terada ...... 17 A pattern-clustering method for longitudinal data - heroin users receiving methadone Chien-Ju Lin, Christian Hennig ...... 18 Quantile-based classiﬁers Christian Hennig, Cinzia Viroli ...... 19

Invited Session 2: Recent Advances in Mixture Modelling . . . . 20 Clustering heteroscedastic data Gunter Ritter ...... 21 XLVIII Contents Diagnostics for model-based clustering via mixture models with covariates Salvatore Ingrassia, Antonio Punzo ...... 23 Mixture models for ordinal data: a pairwise likelihood approach Monia Ranalli and Roberto Rocci ...... 24

Invited Session 3: Data analysis in human-computer interaction scenarios ...... 25 A Study on the Impact of Additional Modalities on Automatic Emotion Recognition Jonghwa Kim ...... 26 Analyzing and labeling multimodal data in human- computer interaction using ATLAS Sascha Meudt, Friedhelm Schwenker ...... 27 Automated Pain Recognition System on the Basis of Biopotentials and Video Recording Steffen Walter, Sascha Gruss, Junwen Tan, Kerstin Limbrecht- Ecklundt, Harald C. Traue, Philipp Werner, Ayoub Al-Hamadi ... 28

Invited Session 4: Health technology assessment of community interventions for active healthy aging ...... 29 Business intelligence in the context of integrated care systems (ICS): experiences from the ICS ?Gesundes Kinzigtal? in Germany Alexander Pimperl, Timo Schulte, Helmut Hildebrandt ...... 30 Sample size considerations for primary prevention studies with anchoring vignettes Stavros Poupakis, Adi Florea, Hongsheng Dai, Helge Gillmeister, Peter Lynn, Aris Perperoglou, Berthold Lausen ..... 32 Testing Lifestyle Theories with Different Data Analyses: Modelling Multiple Behavior Change in Behavioral and Health Care Sciences Sonia Lippke, Lena Fleig, Amelie U. Wiedemann, Ralf Schwarzer 33 Contents XLIX Invited Session 5: SVM Large Scale Learning ...... 34 Linear SVM Training with Online Adaptation of Coordinate Frequencies Tobias Glasmachers ...... 35 Stochastic gradient algorithms for solving large-scale learning problems Sangkyun Lee ...... 36 A Comparative Study of Kernelized Support Vector Machines Daniel Horn, Aydın Demircioglu,˘ Bernd Bischl, Tobias Glasmachers, Claus Weihs ...... 37 Support Vector Machines for Active Learning Jan Kremer, Kim Steenstrup Pedersen, Christian Igel ...... 38

Invited Session 6: Predictions with Classiﬁcation Models . . . . . 39 Dynamic Aspects of Bankruptcy Prediction Models Jozef´ Pociecha, Barbara Pawełek, Mateusz Baryła ...... 40 Simple Random Sampling With Replacement as a Technique of Companies Selection in Corporate Bankruptcy Prediction Mateusz Baryła, Barbara Pawełek, Jozef´ Pociecha ...... 41 The use of hybrid predictive C&RT-logit models in analytical CRM Mariusz Łapczynski...... 42

Part IV Contributed Sessions

CON-1A: Machine Learning and Knowledge Discovery I . . . . 45 On the Inﬂuence of Missing Data Methods on Decision Tree Induction Kristof Szillat, Dieter William Joenssen ...... 46 L Contents Bagging Heterogeneous Decision Trees Fabian Wolff, Dieter William Joenssen ...... 47 Comparison and Statistical Evaluation of Similarity Measures on Concepts in a Lattice F. Domenach, G. Portides ...... 48 Quadtree decomposition for textual information localization in document images Cynthia PITOU, Jean DIATTA ...... 49 An Ensemble of Optimal Trees for Class Membership Probability Estimation Zardad Khan, Asma Gul, Osama Mahmoud, Miftahuddin Miftahuddin, Aris Perperoglou, Werner Adler, Berthold Lausen .. 50

CON-1B: Data Analysis in Finance I ...... 51 Assessing systemic risk with Dynamic Conditional Beta approach Kuziak Katarzyna ...... 52 Power of skewness tests in the presence of fat tailed ﬁnancial distributions Krzysztof Piontek ...... 53 A Hidden Markov Model to detect relevance in ﬁnancial documents based on on/off topics Dimitrios Kampas, Christoph Schommer, Ulrich Sorger ...... 54 Born print, reborn digital - the Hoppenstedt Data Archive Irene Schumm, Sebastian Weindel, and Philipp Zumstein ...... 55 Experimental design in evaluation of VaR independence tests Marta Maecka ...... 56

CON-1C: Statistics and Data Analysis I ...... 57 Prediction of Upper Esophageal Sphincter Restitution Time in Aberrant Swallows Nicolas Schilling, Andre Busche, Simone Miller, Michael Jungheim, Martin Ptok, Lars Schmidt-Thieme ...... 58 Contents LI Exploiting longitudinal epidemiological data in similarity- based classiﬁcation T. Hielscher, H. Volzke,¨ J.-P.-Kuhn,¨ M. Spiliopoulou ...... 59 Incremental Generalized Canonical Correlation Analysis Angelos Markos, Alfonso Iodice D’Enza ...... 61 Weighted Rank Correlation Measures Based on Fuzzy Order Relations Sascha Henzgen, Eyke Hullermeier¨ ...... 62

CON-1D: Data Analysis in Interdisciplinary Domains (Musicology I) ...... 63 Fast Model Based Optimization of Tone Onset Detection by Instance Sampling Nadja Bauer, Klaus Friedrichs and Claus Weihs ...... 64 Recognition of leitmotives in Richard Wagner’s music: chroma distance and listener expertise Daniel Mullensiefen,¨ David Baker, Christophe Rhodes, Tim Crawford, Laurence Dreyfus ...... 65 The Surprising Character of Music. A Search for Sparsity in Music Evoked Body Movements. Denis Amelynck, Pieter-Jan Maes, Marc Leman, Jean-Pierre Martens ...... 66 An iterative learning approach to dataset demarcation in music analysis Dan Tidhar, Srikanth Cherla, Daniel Wolff, Tillman Weyde ...... 68

CON-1E: Data Analysis in Marketing I ...... 69 Should ﬁnite mixture conjoint choice models account for utility interdependencies? Friederike Paetz, Winfried J. Steiner ...... 70 Quality evaluation of microeconometric models used in consumer preferences analysis Tomasz Bartłomowicz and Andrzej Ba¸k ...... 71 LII Contents Wine consumer preference analysis with application of conjoint package of R Aneta Rybicka, Marcin Pełka ...... 73 Casting the Net: Category Spillover Effects in Crowdfunding Platforms Dieter William Joenssen, Thomas Mullerleile¨ ...... 74

CON-1F: Machine Learning and Knowledge Discovery II . . . . 75 Proﬁt Measure using Objective Oriented Utility, Causality and Domain-Knowledge.

Otilia Alejandro, Sylvie Ratte´ ...... 76 Risk Analysis of User Behavior in Online Communities Towards Churn Philippa A. Hiscock, Jonathan J. Forster, Athanassios N. Avramidis, Jorg¨ Fliege ...... 77 Mining Fuel-Inefﬁcient Driving Behaviors From GPS Trajectories Josif Grabocka, Lars Schmidt-Thieme ...... 78 A Signature Based Method for Fraud Detection on E-Commerce Scenarios Orlando Belo, Gabriel Mota, Joana Fernandes ...... 79 Experimental design for estimation of the recovery of deep-water megafaunal assemblages from hydrocarbon drilling disturbance in the Faroe-Shetland Channel Jones, Daniel, Baeshen, Marwa, Miftahuddin, Miftahuddin, Poupakis, Stavros, Lausen, Berthold ...... 80

CON-2A: Machine Learning and Knowledge Discovery III. . . 81 Feature selection for additive kernel classiﬁers

Surette Bierman, Nelmarie Louw, Sarel Steel...... 82 Contents LIII Utilizing semantics for guiding multi-classiﬁer systems Ludwig Lausser, Florian Schmid, Johann Kraus, Axel Furstberger,¨ Hans A. Kestler∗ ...... 83 Characterizing feature selection algorithms Lyn-Rouven Schirra, Lausser Ludwig, Hans A. Kestler∗ ...... 84 Network and Data Integration for Biomarker Signature Discovery via Network Smoothed T-Statistics Yupeng Cun, Holger Frohlich¨ ...... 85 Minimizing Redundancy among Genes Selected Based on the Overlapping Analysis Osama Mahmoud, Andrew Harrison, Asma Gul, Zardad Khan, Metodi V. Metodiev, Berthold Lausen ...... 86

CON-2B: Data Analysis in Social Sciences I ...... 87 The Effects of Parenthood on Well-Being Collin Vance, Evgenia Samoilova ...... 88 How health literacy facilitates healthy lifestyle habits: Analyses of data from an online study Juliane Paech, Sonia Lippke ...... 89 Validation of questionnaires using a pilot trial and the English Longitudinal Study of Ageing Florea, Adi, Gage, Faith, Head, Samantha, Reynolds, Terri, Marsland, Louise, Jackson, Joanna, Lausen, Berthold ...... 90 Student Life-Style Revisited - Values, Attitudes and Behavior Andreas Geyer-Schulz, Thomas Hummel, Victoria-Anne Schweigert ...... 91 MultiTrait-MultiMethod (MTMM) and CFA model in comparative analysis of 5, 7, 9 and 11 point scales Piotr Tarka ...... 92

CON-2C: Statistics and Data Analysis II ...... 93 LIV Contents The analysis of incomplete multi–way tables with the use of log–linear models Justyna Brzezinska´ ...... 94 The Weight of Penalty Optimization for Ridge Regression Sri Utami Zuliana, Aris Perperoglou ...... 95 Correlated component regression: Proﬁling student performances by means of background characteristics Bernhard Gschrey, Ali Unl¨ u¨ ...... 96 Data Envelopment Analysis for City Efﬁciency Daniel Reißmann, Iris Lehmann, Jorg¨ Hennersdorf, Clemens Deilmann, Martin Behnisch ...... 97 Optimization of a Simulation for Inhomogeneous Mineral Subsoil Machining Swetlana Herbrandt, Claus Weihs, Manuel Ferreira, Christian Rautert ...... 99

CON-2D: Data Analysis in Social Sciences II ...... 100 Utilization of Panel Data Analysis to Predict the Risk of Poverty of EU Households Maria´ Stachova,´ Luka´sˇ Sob´ısekˇ ...... 101 Applying the Fuzzy Set Theory to Identify the Non- monetary Factors of Poverty Marta Dziechciarz-Duda, Klaudia Przybysz ...... 102 Ordered logistic model as a tool to identify the determinants of poverty risk among Polish households Andrzej Wołoszyn, Izabela Kurzawa, Romana Głowicka-Wołoszyn 103 Multivariate Logistic Mixtures Xiao Liu, Ali Unl¨ u¨ ...... 105 Fast DD-classiﬁcation of functional data Karl Mosler, Pavlo Mozharovskyi...... 106

CON-3A: Machine Learning and Knowledge Discovery IV . . . 107 Contents LV Monitoring dynamic weighted majority method with Adaptive control chart based on real datasets with concept drift Dhouha Mejri, Mohamed Limam and Claus Weihs ...... 108 Ensemble of k-Nearest Neighbour Classifiers for Class Membership Probability Estimation Asma Gul, Zardad Khan, Osama Mahmoud, Miftahuddin, Werner Adler, Aris Perperoglou, Berthold Lausen ...... 109 Multivariate functional regression analysis with application to classification problems Tomasz Gorecki,´ Waldemar Wołynski´ ...... 110 Assessing the reliability of a multi-class classifier Luca Frigau, Claudio Conversano, Francesco Mola ...... 111

CON-3B: Clustering I...... 112 A Biclustering Model and Method for a Sparse Binary Data Tadashi Imaizumi ...... 113 Geographic clustering through aggregation control Daher Ayale ...... 114 Three-way clustering problems in regional science Andrzej Sokołowski, Małgorzata Markowska, Danuta Strahl ....115 Evaluating the Necessity of a Triadic Distance Model Atsuho Nakayama ...... 116

CON-3C: Statistics and Data Analysis III ...... 117 Employing cluster analysis for the study of income tax law of Greece Leonidas Tokou, Iannis Papadimitriou, Athanasios Vazakidis ....118 A comparison of heuristic and model-based clustering methods for dietary pattern analysis Claudia Bornhorst,¨ Benjamin Greve and Iris Pigeot ...... 119 LVI Contents Reification of subjective vehicle impressions - Objectification of the individual perceived quality from head-up-display images Sonja Maria Koppl¨ ...... 120 Reification of subjective vehicle impressions - Apply classification methods to predict the perceived quality from head-up-display images Sonja Maria Koppl¨ ...... 121

CON-3D: Data Analysis in Marketing II ...... 122 Evaluating Advertising Campaigns Using Image Data Analysis and Classification Daniel Baier, Sarah Frost, Ines Daniel ...... 123 Accommodating Heterogeneity and Nonlinearity in Price Effects for Predicting Brand Sales and Profits Winfried J. Steiner, Stefan Lang, Anett Weber, Peter Wechselberger124 Adaptive Discrete Choice Models for Brand Price Trade-Off Peter Kurz ...... 125 Lead User Classification for Data Analysis in Marketing Alexander Sann,¨ Daniel Baier ...... 127

CON-3E: Big Data Analytics ...... 128 Missing Data Methods for Big Data Analysis Dieter William Joenssen ...... 129 Big Data Oriented Symbolic Data Analysis in Cloud Hiroyuki MINAMI, Masahiro MIZUTA ...... 130 Big Data Analytics vs. Classical Data Science Claus Weihs ...... 131 Epistemic Uncertainty Sampling for Active Learning on Data Streams Ammar Shaker, Eyke Hullermeier¨ ...... 132

CON-3F: Statistics and Data Analysis IV...... 133 Contents LVII Specialization in Smart Growth Sectors vs. Effects of Workforce Number Changes in the European Union Regional Space Elzbieta˙ Sobczak, Marcin Pełka ...... 134 Comparison of working conditions in European countries with respect to gender, age and education Zerrin Asan Greenacre, Michael Greenacre...... 135 The Identiﬁcation of Relations Between Smart Growth and Sensitivity to Crisis in the European Union Regions – Panel Data Analysis Beata Bal–Domanska´ ...... 136 A model for comparing government expenditure on civil servants ? compensation of gross wages and salaries in EU24 J.C. Nwaubani, N. Kapoulas ...... 137

CON-4A: Machine Learning and Knowledge Discovery V . . . 139 Multi-label classiﬁcation using multivariate linear regression

Sarel Steel, Surette Bierman ...... 140 Active Multi-Instance Multi-Label learning Robert Retz, Friedhelm Schwenker ...... 141 Generation of Datasets for Label Ranking Massimo Gurrieri, Philippe Fortemps, Xavier Siebert, Marc Pirlot, Nabil Ait Taleb, Yves Desmet ...... 142 Subset Correction for Multi-Label Classiﬁcation Robin Senge, Eyke Hullermeier¨ ...... 143 Supervised Classiﬁcation of Viral Genomes based on Restriction Site Distribution Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Banire´ Diallo ...... 144

CON-4B: Visualization and Graph Models ...... 145 LVIII Contents Uncertainty in Medical Data Analysis in the Case of Carotid Vessel Visualization Gordan Ristovski, Tobias Preusser, Horst Hahn, Lars Linsen ....146 Visual Analysis of Multi-run Spatio-temporal Simulation Data Alexey Fofonov, Lars Linsen ...... 148 Size and shape effects in biplots

Michael Greenacre ...... 149 Reviewing Graphical Modelling of Multivariate Temporal Processes Matthias Eckardt ...... 150 Learning Hierarchical Document Classiﬁcations from Recommender Graphs: An Application of Modularity Clustering Fabian Ball, Andreas Geyer-Schulz ...... 151

CON-4C: Statistics and Data Analysis V ...... 152 A comparison study for spectral, ensemble and spectral- mean shift clustering approaches for interval-valued symbolic data Marcin Pełka ...... 153 Clustering and solar radiance prediction Henri Ralambondrainy, Yves Lechevallier Jean-Daniel Lan-Sun-Luk J.P. Chabriat ...... 154 Moving Functional MDS and its Application to Monitoring Post Data in Fukushima Prefecture Masahiro MIZUTA, Hiroyuki MINAMI ...... 155 Using annotated sufﬁx tree similarity measure for text summarisation Maxim Yakovlev, Ekaterina Chernyak ...... 156 Contents LIX Using Hidden Markov Models to Improve Analyzing Accelerometer Data Norman Wirsik, Vitali Witowski, Ronja Foraita, Yannis Pitsiladis, Iris Pigeot ...... 157

CON-4D: Data Analysis in Social Sciences III ...... 158 Making Sense of Qualitative Data: An Application of the Gioia Method Fabiola H. Gerpott, Sven C. Voelpel ...... 159 Applying Multilevel Path Analysis: Analyzing the Role of Leaders’ Work Engagement for Subordinates’ Work Engagement Daniela Gutermann, Nale Lehmann-Willenbrock, Diana Boer, Sven C. Voelpel ...... 160 Mentoring in Context: An Application of Multilevel Mediation Models in Organizational Research Doris Rosenauera, Annelies E. M. Van Vianenb, Astrid C. Homanc, Christiane A. L. Horstmeierd, Sven C. Voelpele ...... 161 Nonhierarchical Asymmetric Cluster Analysis of Relationships Among Managers at a Firm Akinori Okada, Satoru Yokoyama ...... 162 Identiﬁcation of digital skills proﬁles using ICT usage data Dominik Antoni Rozkrut ...... 163

CON-4E: Data Analysis in Finance II ...... 164 Facilitating household financial plan optimization by adjusting time range of analysis to life-length risk aversion Lukasz Feldman, Radoslaw Pietrzyk, Pawel Rokita ...... 165 Constructing cumulated net cash flow scenarios with an underlying two-person survival process for household financial planning Pawel Rokita ...... 166 LX Contents Firm-specific determinants on dividend changes: insights from data mining Karsten Luebke, Joachim Rojahn ...... 167 Excess Takeover Premiums and Takeover Contests - An Analysis of Different Approaches for Determining Abnormal Offer Prices Wolfgang Bessler, Colin Schneck ...... 168 Interval estimation of Value-at-Risk and Expected Shortfall for ARMA-GARCH models Krzysztof Piontek ...... 169

CON-5A: Machine Learning and Knowledge Discovery VI . . . 171 Analysing Psychological Data by Evolving Computational Models Peter C. R. Lane, Peter D. Sozou, Fernand Gobet, Mark Addis ...172 Information Theoretic Measures for Ant Colony Optimization Gunnar Vlkel, Markus Maucher, Christoph Mssel, Uwe Schning, Hans A. Kestler ...... 173 Modelling sea surface temperature in the Indian Ocean using gamboostLSS Miftahuddin, Marwa Baeshen, Adi Florea, Stavros Poupakis, Benjamin Hofner, Andreas Mayr, Berthold Lausen ...... 174 Estimating age- and height-dependent percentile curves for children using GAMLSS in the IDEFICS study Timm Intemann, Hermann Pohlabeln, Diana Herrmann, Wolfgang Ahrens, Iris Pigeot ...... 175

CON-5B: Data Analysis in Interdisciplinary Domains ...... 176 A Bayesian approach to test the matching law with observational data Johannes Zschache ...... 177 Contents LXI Handling missing data in non-Gaussian hierarchical Bayesian Dynamic Models Casper J. Albers ...... 178 Bayesian analysis for mixtures of discrete distributions with a non-parametric component Baba B. Alhaji, Hongsheng Dai, Yoshiko Hayashi, Berthold Lausen ...... 179 Exploring Unknown Terrain Irmela Herzog ...... 180

CON-5C: Data Analysis in Social Sciences IV ...... 181 Social Value Orientation and Expectations of Cooperation: A Meta-Analysis Jan Luca Pletzer, Daniel Balliet, Sven C. Voelpel ...... 182 Using Meta-Analytic Structural Equation Modeling (MASEM) to Test New Models in Organizational Research: The Example of Transformational Leadership Effects on Identiﬁcation at Work Christiane A. L. Horstmeiera, Diana Boer, Astrid C. Homan, Sven C. Voelpelb ...... 183 Biasing Effects of Non-representative Samples of Quasi- orders in the Assessment of Recovery Quality of IITA-type Item Hierarchy Mining Ali Unl¨ u,¨ Martin Schrepp ...... 184 Let’s Get Dynamic: Interaction Analysis Fabiola H. Gerpott, Nale Lehmann-Willenbrock ...... 185

CON-5D: Data Analysis in Interdisciplinary Domains (Musicology II) ...... 186 Duplicate detection in facsimile scans of early printed music Christophe Rhodes, Tim Crawford, Mark d’Inverno ...... 187 Combining audio features and playlist statistics for improved music category recognition Igor Vatolkin, Geoffray Bonnin, Dietmar Jannach ...... 188 LXII Contents Digital Music Lab - A Framework for Analysing Big Music Data Tillman Weyde, Stephen Cottrell, Emmanouil Benetos, Daniel Wolff, Dan Tidhar, Jason Dykes, Mark Plumbley, Simon Dixon, Mathieu Barthet, Nicolas Gold, Samer Abdallah, Mahendra Mahey ...... 189 Machine Learning for the Analysis of a Large Collection of Musical Scales Srikanth Cherla, Dan Tidhar, Artur d’Avila Garcez, Tillman Weyde191

CON-5E: Clustering II ...... 192 Supervised pretreatments are useful for supervised clustering Vincent Lemaire, Oumaima Alaoui Ismaili,, Antoine Cornuejols´ . 193 Bottom-up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal Hans-Joachim Mucha, Hans-Georg Bartel ...... 194 Bayesian clustering of functional data in the presence of covariates Damien Juery, Christophe Abraham, Ben´ edicte´ Fontez ...... 196 K-mode Clustering with Dimensional Reduction for Categorical Data Kensuke Tanioka, Hiroshi Yadohisa ...... 197

Part V LIS’2014 Workshop

Workshop on Classification and Subject Indexing in Library and Information Science (LIS’2014) ...... 201 Collaborative Literature Work in the Scientific & Educational Publication Process: The Cogeneration of Citation Networks Leon Burkard, Andreas Geyer-Schulz ...... 202 Contents LXIII Context Analysis and Context Indexing: Formal Pragmatics in Knowledge Organization Michael Kleineberg ...... 203 PubSim: A graph based classification recommendation system for mathematical publications Susanne Gottwald, Thorsten Koch ...... 205 Subject indexing for author disambiguation - opportunities and challenges. Cornelia Hedeler, Andreas Oskar Kempf, Jan Steinberg ...... 206 Bibliographic Report 2013: A choice of relevant decimal classification literature // Online report 2013: A choice of interesting web-features for classifications Bernd Lorenz ...... 208 Online Report 2013: A choice of interesting web-features for classifications Michael Franke ...... 210 Classification Systems in german public libraries. An overview about the status quo and its application. Frank Seeger ...... 211 Storing and Analyzing Bibliographic Metadata with ElasticSearch Clemens Dupmeier¨ ...... 212 Subject Indexing of Textbooks - Challenges in the Construction of a Discovery System Esther Chen, Jessica Drechsler, Bianca Pramann, Robert Strtgen 213 The ofness and aboutness of survey questions: improved indexing of social science data Tanja Friedrich, Dr. Pascal Siegers ...... 214

Part II

Keynotes

Do It Yourself: Exploratory Analysis of Very Large Scientiﬁc Data

Themis Palpanas1

Paris Descartes University, Paris, France [email protected]

Abstract. There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of data series. Ex- amples of such applications come from biology, astronomy, entomology, the web, and other domains. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. In this talk, we describe recent efforts in designing techniques for indexing and mining truly massive collections of data series that will enable scientists to easily analyze their data. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce solutions to this problem. Furthermore, we discuss novel techniques that adaptively create data series indexes, allowing users to correctly answer queries before the indexing task is ﬁnished. Finally, we show how our methods allow mining on datasets that would otherwise be completely untenable, including the ﬁrst published experiments using one billion data series.

3 Latent Variables and Marketing Theory – the Paradigm Shift

Adam Sagan1

Cracow University of Economics, Rakowicka 27, 31-510 Cracow, Poland [email protected]

Abstract. Extensive discussion concerning formal, empirical and ontological status of latent variables in psychological literature concerns the distinction between realist and anti-realist positions within the CTT and IRT psychometric traditions in measurement of latent variables (Borsboom 2008, Salzberger and Koller 2013). However, this bi-polar view seems to be too distant from actual schools of thought in marketing discipline. In the paper, the interlinks between the variety of schools in marketing like functional, commodity, institutional, managerial, marketing systems, consumer behavior, macromarket- ing, exchange, interaction & network, relationship and SDL (Jones, Shaw and McLean 2010) and conceptual and operational status of latent variables are discussed. The diversity of measurement models for latent variables in marketing theory involves at least three approaches to latent variables: 1/ common factor and IRT models (latent traits), 2/ composites and formative latent variables and 3/ relational latent variables. Understanding the statuses of the latent variables and school of thought in marketing theory may improve both the use of latent variables inside marketing discipline and effectiveness of communication between the scholars.

References

BORSBOOM, D. (2008): Latent Variable Theory. Measurement, 6, 25–53. JONES, D. G. B, SHAW, E. H., MCLEAN P. A., (2010):The Modern Schools of Market- ing Thought. In: P. Maclaran, M. Saren, B. Stern and M. Tadajewski (Eds.): The SAGE Handbook of Marketing Theory. SAGE, London, 42–58. SALZBERGER, T. and KOLLER, M., (2013): Towards a New Paradigm of Measurement in Marketing Journal of Business Research, 66, 1307–1317.

Keywords

LATENT VARIABLES,MEASUREMENT THEORY, MARKETING PARADIGMS

4 Solving the Identity Crisis: Large-Scale Clustering with Distributions of Distances and Applications in Record Linkage

Rebecca Nugent

Carnegie-Mellon University, Pittsburgh, PA, USA, [email protected]

Abstract. “Will the real Rebecca Nugent please stand up?” Deduplication is the process of linking records corresponding to unique entities within a single database, where each unique entity may be duplicated multiple times. Most often these records are largely text, but can also contain continuous or categorical fields (e.g. last name vs age). We frame deduplication as a clustering problem, where each observed record belongs to a cluster corresponding to some latent unique entity in the underlying population. In large-scale deduplication problems (e.g. millions of records), calculating and analyzing the necessary pairwise comparisons is computationally infeasible. Instead, we adopt a divide-and-conquer strategy and re-frame our problem as clustering records given different estimates of their similarity (or inter-record distance). More specifically, we define the possible distances between record-pairs to be a monotonically decreasing transformation of their pairwise matching probabilities, which are obtained via an ensemble of classifiers. Rather than aggregating the classifier results, e.g., by taking the mean distance, we instead cluster the records by mapping features of these distributions of distance to the best approximation of the true distance between records. In general, this approach can be used when we have large, computationally intractable datasets and/or unavailable or uncertain distances between observations. We find that this more flexible approach improves match classification and clustering performance for the more difficult records to link (e.g., the ”coin flip” decisions). With respect to computational efficiency, we introduce a sequential blocking scheme that reduces the number of comparisons needed with negligible effect on performance. We present results from the identification of unique inventors in the United States Patent and Trademark Office patent-inventor database and, if time permits, discuss our ongoing work with the Human Rights Data Analysis Group to identify the unique fatalities in the ongoing Syrian civil war.

Keywords

Clustering, deduplication, distance, latent entity, pairwise matching probabilities

5 Learning with Big Data by Incremental Optimization of Performance Measures

Zhi-Hua Zhou

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China [email protected]

Abstract. A popular approach to achieve a strong learning system is to take the performance measure that will be used for evaluation as an optimization target, and then accomplish the learning task by an optimization procedure. Many performance measures in machine learning, however, are unfortunately non-linear, non-smooth and non-convex, leading to difﬁcult optimization problems. With big data, the optimization becomes even more challenging because of the concerns of computational, storage, communication costs, etc. Particularly, it becomes almost impossible to collect all data at ﬁrst and then perform optimization, and it is desired to be able to optimize performance measures incrementally, without accessing the whole data. In this talk we will introduce some recent studies along this direction.

References

GAO, W., JIN, R., ZHU, S., and ZHOU, Z.-H. (2013): One-pass AUC optimization. In: Pro- ceedings of the 30th International Conference on Machine Learning, Atlanta, GA, 906– 914. LI, N., TSANG, I. W., and ZHOU, Z.-H. (2013): Efﬁcient optimization of performance measures by classiﬁer adaptation. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 35: 1370–1382.

Keywords

INCREMENTAL OPTIMIZATION, PERFORMANCE MEASURE, ONE-PASS LEARNING, CLASSIFIER ADAPTATION

6 The Speed-Interpretability-Accuracy Trade-Off in Decision Tree Learning

Roberta Siciliano

Department of Industrial Engineering, University of Naples Federico II, Italy, [email protected]

Abstract. Decision Tree Learning considers a connected and oriented tree graph to describe a hierarchical data structure with a set of linked nodes and the end-nodes called ”leaves”. Starting from the root node a sample of objects is recursively partitioned into a fixed number of subsamples (internal nodes) such to reduce the variation/heterogeneity of a numerical/categorical response variable on the basis of a set of predictors of any type. Leaves are labeled with the response average value/modal class allowing for the interpretation of the dependence relationships through the tree paths. Ensemble methods such as random forest, boosting, bagging combine more tree structures to define an accurate decision rule to predict the response value/class for a new object for that only the predictors measurements are known. Trees are widely used in supervised classification and non parametric regression for data mining and prediction in many fields of applications. In the paper, we discuss the speed- interpetability-accuracy trade-off and we provide some methodological proposals to improve stability in tree growing and to reduce the computational cost while assuring accuracy and interpretability in decision rule production.

Keywords accuracy, computational cost, decision trees, interpretability, supervised learning

7 Visualization and Data Mining for High Dimensional Data

Alfred Inselberg1

School of Mathematical Sciences – Tel Aviv University – www.math.tau.ac.il/aiisreal/,˜ Senior Fellow San Diego SuperComputing Center [email protected]

Abstract. A dataset with M items has 2M subsets anyone of which may be the one satisfying our objectives. With a good data display and interactivity our fantastic pattern-recognition can cut great swaths searching through this combinatorial explosion unlocking surprising insights. That is the core reason for data visualization. With parallel coordinates the search for relations in multivariate data is transformed into a 2-D pattern recognition problem. The knowledge discovery process is illustrated on several real multidimensional datasets. There is also a geometric classification algorithm with low computational complexity providing the classification rule explicitly and visually. The minimal set of variables required to state the rule, features, is found and ordered by their predictive value. A complex system is modeled as a hypersurface enabling interactive exploration of its functions, sensitivities, trade-offs, impact of constraints and more. An overview of the methodology provides foundational understanding; learning the patterns corresponding to various multivariate relations. These patterns are robust in the presence of errors and that is good news for the applications. The parallel coordinates methodology has been applied to collision avoidance and conflict resolution algorithms for air traffic control (3 USA patents), computer vision (USA patent), data mining (USA patent) and elsewhere.

8 How uncertainty simultaneously drives and hinders knowledge discovery from health data

Pedro Pereira Rodriguez

University of Porto, Porto, Portugal. [email protected]

Abstract.

Keywords

9 On A Decision-Maker Without Preferences

Andreas Geyer-Schulz1

Karlsruhe Institute of Technology [email protected]

Abstract. A decision-maker without preferences is a decision-maker which chooses an element of a choice set with equal probability. The problem is trivial, if the choice set is known a-priori. However, if the choice set (and its size n) is not known, we construct an (inﬁnite) series of probability spaces. We derive two models of a decision-maker without preferences: The ﬁrst model considers the probability of making r co-purchases of at least one element with a certain element i. For the second model, we compute the probability of observing a the complete histogram (partition) of all co-purchases with element i. The implication is that complementarities between elements exist and that they are contained in the partition. In the second model we study the probability distribution of potential choice variants of k items out of n. We observe that, depending on n, rank reversals of choice variants occur, although the decision-maker acts completely rational (for small n). For large n, the order of the choice variants becomes stable, no further anomalies occur. We link this to the axiom of the violation of the independence of irrelevant alternatives in decision-theory. And in addition, we refer to research in marketing on the way consumer choices are modelled by a subsequent restriction of the choice set and e.g. the effect of branding on the human brain.

Keywords

PREFERENCE LEARNING, RANK REVERSAL, INPENDENCE OF IRRELEVANT ALTERNATIVES, DECISION-THEORY

10 Partially supervised learning algorithms in classiﬁcation and clustering

Friedhelm Schwenker1

Ulm University, Institute of Neural Information Processing, D-89069 Ulm [email protected]

Abstract. In a practical machine learning application, e.g. in classiﬁer design, the annotation of the collected data is an expensive and error-prone pre-processing step. In order to apply supervised learning approaches each sample of the data set must be carefully inspected and labeled by one expert, or even by a team of experts. Thus, in complex real-world applications, such as affective computing or medial diagnosis, the result of the annotation process is a partially labelled data set: some instances have a crisp label, other samples remain unlabeled, and another part of the data might be labeled by fuzzy memberships. In partially supervised learning (PSL) the goal is to develop new algorithms for such problems, and to investigate settings where PSL algorithms can be applied successfully. In this contribution we discuss different types of PSL-approaches, namely active learning, semi-supervised learning in classiﬁcation and clustering, and learning from fuzzy labels, furthermore, we present results of PSL-algorithms achieved for pattern recognition problems in human-computer interaction scenarios.

References

[1]Schwenker, F. and Trentin, E. (2013): Pattern classiﬁcation and clustering: A review of partially supervised learning approaches. Pattern Recognition Letters, 37, 4–14. [2]Zhu, X. (2008): Semi-supervised learning literature survey, Technical report 1530, Depart- ment of Computer Sciences, University of Wisconsin, Madison. [3]Settles, B. (2010): Active learning literature survey, Technical report 1648, Department of Computer Sciences, University of Wisconsin, Madison. [4]Thiel, C. Scherer, S. and Schwenker, F. (2007): Fuzzy-Input Fuzzy-Output One-Against-All Support Vector Machines, In: Knowledge-Based Intelligent Information and Engineering Systems, Springer LNAI 4694, 156-165.

Keywords active learning, semi-supervised learning, learning from weak labels

Part III

Invited Sessions

3 Invited Session 1: High-dimensional data: using all dimensions

Organized by Christian Hennig

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 1 Linear Storage and Potentially Constant Time Hierarchical Clustering Using the Baire Metric and Random Spanning Paths

Fionn Murtagh1 and Pedro Contreras2

1 School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, England [email protected] 2 Thinking Safe Ltd., Orchard Building, Royal Holloway, Egham TW20 0EX, England [email protected]

Abstract. We study how random projections can be used with very large data sets in order (i) to cluster the data using a fast, binning approach which is characterized in terms of direct inducing of a hierarchy through use of the Baire metric; and (ii) based on clusters found, selecting subsets of the original data for further analysis. In this current work, the latter is analysis of clusters found, using the original data. Hence it is inter-cluster analysis, rather than intra-cluster. A random projection, outputting a random permutation of the observation set, provides a random spanning path. We show how a spanning path relates to contiguity- or adjacency-constrained clustering. We study performance properties of hierarchical clustering constructed from random spanning paths, and we introduce a novel visualization of the results.

References

BOUTSIDIS, C., ZOUZIAS, A. and DRINEAS, P. (2010): Random Projections for k-Means Clustering. In: Advances in Neural Information Processing Systems 23, iii, pp. 298–306. CONTRERAS, P. and MURTAGH, F. (2012): Fast, Linear Time Hierarchical Clustering using the Baire Metric, Journal of Classiﬁcation, 29, 118–143. FERN, X.Z. and BRODLEY, C.E. (2003): Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach. In: T. Fawcett and N. Mishra, eds., Proceed- ings of the Twentieth International Conference on Machine Learning, pp. 186–193, 2003.

Keywords

HIERARCHICAL CLUSTERING, BAIRE METRIC/ULTRAMETRIC, MASSIVE HIGH DIMENSIONAL DATA, RANDOM PROJECTION, SPANNING PATH

16 Clustering for high-dimension, low-sample-size data using distance vectors

Yoshikazu Terada

Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan [email protected]

Abstract. In high-dimension, low-sample-size (HDLSS) data, it is not always true that (Eu- clidean) closeness of two objects reflects a hidden cluster structure. Thus, we need an appropriate distance measure between clusters for HDLSS data. Ahn et al. (2013) proposed a clustering method based on the maximal data piling (MDP) distance (Ahn and Marron, 2010), called MDP clustering. Under certain conditions, MDP clustering can detect the difference between mean vectors of “two” clusters. However, the sufficient condition for the label consistency of MDP clustering depends on the sample sizes and variances of the two clusters, while MDP clustering only focuses on the difference between the mean vectors of two clusters. In this study, we point out the important fact that it is not the closeness, but the “values” of the Euclidean distance that contain information regarding the cluster structure in HDLSS data. Based on this fact, we propose an efficient and simple clustering approach, called distance vector clustering, for HDLSS data. Distance vector clustering can detect not only the differences between mean vectors of clusters but also the differences between the variances of clusters. Under the assumption given in the work of Hall et al. (2005), we show that the proposed approach gives the true cluster label under milder conditions.

References

AHN, J., LEE, M. H., and YOON, Y. J. (2013): Clustering high dimension, low sample size data using the maximal data piling distance. Statistica Sinica, 22, 443–464. AHN, J. and MARRON, J. S. (2010): The maximal data piling direction for discrimination. Biometrika, 97, 254–259. HALL, P., MARRON, J. S., and NEEMAN, A. (2005): Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B, 67, 427–444.

Keywords

PERFECT CLUSTERING, HDLSS DATA, DISTANCE VECTOR

17 A pattern-clustering method for longitudinal data - heroin users receiving methadone

Chien-Ju Lin and Christian Hennig

University College London, England {chien-ju.lin.10, c.hennig}@ucl.ac.uk

Abstract. Methadone is used as a substitute of heroin and there may be certain groups of users according to methadone dosage. In this work we analyze data for 314 participants of a methadone study over 180 days. The data consist of seven categories in which six categories have an ordinal scale for representing dosages and one category for missing dosages. We develop a clustering method involving the so-called p-dissimilarity, partitioning around medoids (PAM)(Kaufman and Rousseeuw, 1990), and a null model test. The p-dissimilarity is used to measure dissimilarity between the 180-day time series of the participants. It accommodates ordinal and categorical scales by using a parameter p as a switch between data being treated as categorical and ordinal. Moreover, we construct a Markov null model without structure of clusters, in which the distributions of the categories are the same as those of the real data. The null model test uses the null model and parametric bootstrap to investigate whether the clusters found according to PAM and the value of the Average Silhouette Width (Kaufman and Rousseeuw, 1990) can be explained by random variation. Despite the fact that no signiﬁcant clustering structure is observed, the sequences of categories for clusters are useful for clinical to prescribe a proper dosage to increase the efﬁciency of methadone maintenance therapy.

References

KAUFMAN, L. and ROUSSEUW, P.J.(1990): Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons.

Keywords

P-DISSIMILARITY, NULL MODEL TEST

18 Quantile-based classiﬁers

Christian Hennig1 and Cinzia Viroli2

1 UCL, Department of Statistical Science, Gower St., London WC1E 6BT, United Kingdom [email protected] 2 University of Bologna, Department of Statistical Sciences, Via Belle Arti 41, 40126 Bologna, Italy [email protected]

Abstract. Quantile-based classifiers are generalisations of the median-based classifiers recently introduced by Hall et al. (2009). They work for potentially high-dimensional data, and are defined by classifying an observation according to a sum of appropriately weighted component-wise distances of the components of the observation to the within-class quantiles. The optimal quantiles can be chosen by minimizing the misclassification error in the training sample. I will present some theory and simulations results demonstrating that quantile classifiers are very competitive. Quantile classifiers will also be applied to the detection of bioaerosol particles based on gaseous plasma electrochemistry (Sarantaridis et al., 2012).

References

HALL, P., TITTERINGTON, D. M. and XUE, J.-H. (2009) Median-based classifiers for high- dimensional data. Journal of the American Statistical Society, 104, 1597-1608. HENNIG, C. and VIROLI, C. (2013) Quantile-based classifiers. http://arxiv.org/abs/1303.1282 SARANTARIDIS, D., HENNIG, C. and CARUANA, D. J. (2012) Bioaerosol detection using potentiometric tomography in flames. Chemical Science, 3, 2210-2216.

Keywords

MEDIAN-BASED CLASSIFIER, HIGH-DIMENSIONAL DATA

19 4 Invited Session 2: Recent Advances in Mixture Modelling

Organized by Roberto Rocci

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 1 Clustering heteroscedastic data

Gunter Ritter1

University Passau [email protected]

Abstract. The heteroscedastic normal model has been known to Karl Pearson more than 100 years ago. Nevertheless, suing this model for the purpose of estimation has been a pain for a long time. The seminal paper of Kiefer and Wolfowitz (1956) contains the simple argument which shows that the mixture likelihood is unbounded. Hence, an MLE does not exist. Peters and Walker [5] and Kiefer [4] proved that there is a consistent local MLE. However, it had already been known to Day [1] that there are plenty of local likelihood maxima. Often there are too many to allow safe estimation of a heteroscedastic normal model. This has prompted practitioners to design their own cluster algorithms. It has prompted specialists to rely on scale constraints such as spherical or homoscedastic sub models which show fewer local maxima. If the properties of the parent distribution comply with the assumptions underlying these models then they allow safe estimation. Otherwise, the estimate may be grossly distorted. Hathaway [3] used scale constraints that guarantee an MLE to exist. Moreover, when they are satisﬁed by th apparent then they render the MLE asymptotically consistent. Because of their theoretical importance they werde named HDBT constraints by Gallegos and Ritter [2]. The shortcoming of Hathaway’s theorem is that it does not disclose the HDBT constraint. The talk will present a method to overcome this problem. It uses the HDBT plot introduced by Gallegos and Ritter [2], a plot of HDBT ratio vs. likelihood. The ﬁnal method does not use any scale constraints. In complex situations, the maximum is still ambiguous. This phenomenon is inherent to mixture analysis. Classical inference is no longer valid here. Some myths of mixture analysis will also be addressed.

References

[1]N.E. Day. Estimating the components of a mixture of normal distributions. Biometrika, 56:463–474, 1969. [2]Maria Teresa Gallegos and Gunter Ritter. Trimmed ML estimation of contaminated mixtures. Sankhya, Series A, 71:164–220, 2009. [3]Richard J. Hathaway. A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Statist., 13:795–800, 1985. [4]N.M. Kiefer. Discrete parameter variation: efﬁcient estimation of a switching regression model. Econometrica, 46:427–434, 1978.

21 22 Gunter Ritter

[5]B. Charles Peters, Jr. and Homer F. Walker. An iterative procedure for obtaining maximum- likelihood estimates of the parameters for a mixture of normal distributions. SIAM J. Appl. Math., 35:362–378, 1978.

Keywords mixture normal distributions, HDBT constraints, local likelihood maxima Diagnostics for model-based clustering via mixture models with covariates

Salvatore Ingrassia and Antonio Punzo1

Department of Economics and Business University of Catania Corso Italia 55, 95129 Catania (Italy) [s.ingrassia, antonio.punzo]@unict.it

Abstract. Mixture models with covariates - which include mixtures of regression, mixtures of regression with concomitant and cluster-weighted models - are ﬂexible statistical methods for clustering heterogeneous populations based on within-group relationships between a response variable and a set of covariates. In this paper we introduce some diagnostic indices and graphical tools for model evaluation and selection. In the framework of model-based clustering, we investigate also geometrical features of the decision surfaces. Case studies based on both simulated and real data are presented.

References

INGRASSIA, S., MINOTTI, S.C. and VITTADINI, G. (2012): Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions, Journal of Classiﬁcation, 29, n.3, 363-401. INGRASSIA, S., MINOTTI, S.C. and PUNZO, A. (2014): Model-based clustering via linear cluster-weighted models, Computational Statistics and Data Analysis, 71, 159-182.

Keywords

MIXTURES OF REGRESSIONS, CLUSTER-WEIGHTED MOD- ELS, MODEL-BASED CLUSTERING

23 Mixture models for ordinal data: a pairwise likelihood approach

Monia Ranalli1 and Roberto Rocci2

1 Department of Statistics, Sapienza University of Rome, [email protected] 2 IGF Department, University of Tor Vergata, Rome, [email protected]

Abstract. A latent Gaussian mixture model to classify ordinal data is investigated. The observed variables are considered as a discretization of an underlying ﬁnite mixture of Gaussians [1,3]. This means that the likelihood function involves a multidimensional integral, whose evaluation is computationally demanding as the number of observed variables increases. Thus the model estimation through a full maximum likelihood approach becomes prohibitive. Dif- ferent solutions are possible to handle with hard likelihood problems. The idea is to replace the likelihood with a surrogate objective function that is easier to maximize. Here, we use a pairwise likelihood [2], whose maximization is performed by an EM-like algorithm. In order to classify the objects, the joint posterior probabilities are approximated running an Iterative Proportional Fitting based on the pairwise posterior probabilities. The effectiveness of the proposal is shown by conducting a simulation study in which the pairwise likelihood approach is compared with the full maximum likelihood and the maximum likelihood for continuous data ignoring the ordinal nature of the variables. Some real examples are illustrated.

References

[1]EVERITT, B. (1988): A ﬁnite mixture model for the clustering of mixed-mode data. Statis- tics & Probability Letters, 6(5), 305–309 [2]LINDSAY, B. (1988): Composite likelihood methods. Contemporary Mathematics 80, 221–239. [3]LUBKE, G., and NEALE, M. (2008): Distinguishing between latent classes and continuous factors with categorical outcomes: Class invariance of parameters of factor mixture models. Multivariate Behavioral Research, 43(4), 592?620.

Keywords

BUSINESS INTELLIGENCE, INTEGRATED CARE SYSTEM, PERFOR- MANCE MANAGEMENT, BALANCED SCORECARD

24 5 Invited Session 3: Data analysis in human-computer interaction scenarios

Organized by Friedhelm Schwenker

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 5 A Study on the Impact of Additional Modalities on Automatic Emotion Recognition

Jonghwa Kim

Institut fur¨ Informatik, Universitat¨ Augsburg, Universitatsstr.¨ 6a, D-86159 Augsburg, Germany [email protected]

Abstract. The main objective of this study is to investigate the impact of additional modalities on the performance of emotion recognition using speech, facial expression and physiological measurements. In oder to compare different approaches, we designed a feature-based recognition system as a benchmark tool which carries out linear supervised classification followed by the leave-one-out cross-validation. As a result of classification of four emotions, it turned out that bimodal approach always improves recognition accuracy of unimodal approach, while the performance of trimodal approach varies strongly depending on the individual. Further- more, we experienced extremely high disparity between single class recognition rates, while we could not observe a best performing single modality in our experiment. Based on these observations, we developed a novel fusion method, called parametric ensemble decision fusion (PEDEF), which lies in building emotion-specific ensembles and exploits advantage of a parametrized decision process.

References

Polikar, R. (2006): Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 21-45 Kim, J. and Andre,´ E. (2008): Emotion recognition based on physiological changes in music listening. IEEE Trans. Pattern Anal. and Machine Intell., vol. 30, no. 12, pp. 2067-2083 Kim, J. and Lingenfelser, F. (2010): Ensemble approaches to parametric decision fusion for bimodal emotion recognition. In: Proc. of Biosignals 2010, Int. Conf. on Bio-inspired Systems and Signal Processing, pp. 460-463.

Keywords

Emotion Recognition, Affective Computing, Biosignals, Multisensory Data Fusion

26 Analyzing and labeling multimodal data in human-computer interaction using ATLAS

Sascha Meudt and Friedhelm Schwenker

Ulm University, Institute of Neural Information Processing, D-89069 Ulm {sascha.meudt | friedhelm.schwenker}@uni-ulm.de

Abstract. Human-computer interaction (HCI) is a process where information is transferred between the two interacting parties, thus in HCI we are dealing with the interaction and communication between humans and machines. In such tasks pattern recognition and machine learning technologies and methods are highly involved, and thus, data to train such classi- ﬁers has to be collected and annotated. In this talk we present ATLAS[1], a graphical tool for the annotation of multi-modal data streams collected in HCI scenarios[2]. In our scenarios, besides multi-channel audio and video inputs, various bio-physiological data are recorded, ranging from multi-variate data such as ECG, EEG, EMG to simple uni-variate signals such as skin conductivity, respiration, blood volume pulse. In addition to raw data, data pre-processing results (extracted features), or even outputs of pre-trained classiﬁers can be displayed in AT- LAS. Tools for semi-automatic data transcription are integrated as well. The performance of ATLAS is presented on Wizzard-Of-Oz data [3].

References

[1]Meudt, S. Bigalke, L. and Schwenker, F. (2012) ATLAS-Annotation tool using partially supervised learning and multi-view co-learning in human-computer-interaction scenarios, In: Proc. of the ISSPA, 1309–1312. [2]Strauß, P., Hoffmann, H., Minker, W., Neumann, H., Palm, G., Scherer, S., Schwenker, F., Traue, H., Walter, W. and Weidenbacher, U. (2006) Wizard-of-oz data collection for perception and interaction in multi-user environments, Proc. of the LREC. [3]Schels, M., Glodek, M., Meudt, S., Schmidt, M., Hrabal, D., Bock,¨ R., Walter, S., and Schwenker, F.(2012) Multi-modal classiﬁer-fusion for the classiﬁcation of emotional states in WOZ scenarios, In: Proc. of the AHFE, 5337–5346.

Keywords human-computer interaction, data labeling, partially supervised learning

27 Automated Pain Recognition System on the Basis of Biopotentials and Video Recording

Steffen Walter1, Sascha Gruss1, Junwen Tan1, Kerstin Limbrecht-Ecklundt1, Harald C. Traue1, Philipp Werner3, and Ayoub Al-Hamadi3

1 Department Psychosomatic Medicine and Psychotherapy, Ulm University 2 Institute for Electronic, Signal Processing and Communication, University of Magdeburg

Abstract. Background and aims: The objective registration of subjective, multi-dimensionally experienced pain is still a problem that has not been adequately solved. Building on the experiences made to date, the aim is to advance an automated pain recognition system using biopotentials and visual data. Methods: For this purpose, we elicited 90 participants painful heat stimuli under controlled conditions, biopotentials and video feature were used to measure the responses. Research questions: What kind of features and feature patterns are most relevant to the robust pain recognition? Is the data fusion significant improved compare with separate biopotential or video signal analyses. Results: The features of (1) electromyography corrugator peak to peak, (2) corrugator shannon entropy, and (3) heart rate value slope RR were chosen as the most selective. It was shown that the automatic recognition rates of the data fusion are significantly superior compare with separate biopotential or video signal analyses. In particular the detection rates were significantly improved with a feature selection methods. Conclusions: All in all, we are advancing towards our vision of an automatic system for an objective measurement of pain which will facilitate pain monitoring, logging and support in a clinical environment.

References

[1]Treister, R., Kliger, M., Zuckerman, G., Goor Aryeh, I., and Eisenberg, E. (2012): Dif- ferentiating between heat pain intensities: the combined effect of multiple autonomic parameters. Pain, 153(9), 1807-1814.

Keywords pain quantiﬁcation; heat; biopotentials

28 6 Invited Session 4: Health technology assessment of community interventions for active healthy aging

Organized by Helmut Hildebrandt and Berthold Lausen

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 1 Business intelligence in the context of integrated care systems (ICS): experiences from the ICS ?Gesundes Kinzigtal? in Germany

Alexander Pimperl1, Timo Schulte2, and Helmut Hildebrandt3

1 OptiMedis AG, Borsteler Chaussee 53, 22453 Hamburg, Germany [email protected] 2 OptiMedis AG, [email protected] 3 OptiMedis AG, [email protected]

Abstract. Patients generate various data with every contact to the health care system. In integrated care systems (ICS)these fragmented patient data sets of the various health care players can be connected. Business intelligence (BI) technologies are seen as valuable tools to gain insights and value from these huge volumes of data. However so far there are just sparse experiences about BI used in the integrated healthcare network context. Therefore the aim of this article is to describe how a BI solution can be implemented in praxis in an ICS and what challenges have to be met. By the example of a BI best practice model - the ICS Gesun- des Kinzigtal - it will be shown that data from various data sources can be linked in a data warehouse, prepared, enriched and used for management support via a BI front-end: starting with the project preparation and development via the ongoing project management up to a final evaluation. Benefits for patients, care providers, ICS management company and sickness funds will be characterised as well as the most crucial lessons learned specified.

References

[1]HILDEBRANDT, H., SCHULTE, T. and STUNDER, B. (2012): Triple Aim in Kinzigtal. Improving population health, integrating health care and reducing costs of carelessons for the UK? Journal of Integrated Care 20(4)205-22. [2]VIJAYARAGHAVAN, V (2011): Disruptive Innovation in Integrated Care Delivery Systems. Available at: www.christenseninstitute.org/wp- content/uploads/2013/04/Disruptive-innovation-in-integrated-care-delivery- systems.pdf [downloaded on 2014-03-31] [3]BARC(2013): Merck und OptiMedis gewinnen den BARC Best Practice Award 2013. Available at: barc.de/news/merck-und-optimedis-gewinnen-den-barc- best-practice- award-2013, [downloaded on 2014-04-25].

30 Business intelligence in the context of integrated care systems (ICS) 31 Keywords

BUSINESS INTELLIGENCE, INTEGRATED CARE SYSTEM, PERFOR- MANCE MANAGEMENT, BALANCED SCORECARD Sample size considerations for primary prevention studies with anchoring vignettes

Stavros Poupakis1, Adi Florea1,2, Hongsheng Dai1, Helge Gillmeister2, Peter Lynn3, Aris Perperoglou1, and Berthold Lausen1

1 Dept. of Mathematical Sciences, University of Essex [email protected] 2 Dept. of Psychology, University of Essex 3 Institute of Social and Economic Research, University of Essex

Abstract. The aim of this study is to assess the efﬁcacy and effectiveness of a primary prevention programme by running a cluster randomised controlled trial with 40 communities (clusters) in six countries (sites). The communities are randomly allocated to treatments (intervention or control) regarding health and social care provision using pairwise matching for similar rural or urban characteristics. We analyse how sample size arguments based on multilevel random effects can be used to reduce costs; i.e. number of participants needed per cluster. Stratiﬁcation for age groups (50 to 60 years; 60 to 70 years) and gender will also be considered. Using data from the SHARE (www.share-project.org) we further investigate if anchoring vignettes can be employed to reduce the sample size. This methodology has become a popular technique in surveys to deal with unobserved heterogeneity in response scales (King et al. 2004). The focus will be on the before and after difference of Quality-of-Life measures such as WHOQOL-BREF and CASP-12 (Wiggins et al. 2008; Howel 2012).

References

KING, G., MURRAY, C.J.L., SALOMON, J.A. and TANDON, A. (2004): Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, 98(1), 191–207. WIGGINS, R.D., NETUVELI, G., HYDE, M., HIGGS, P. and BLANE, D. (2008): The evaluation of a self-enumerated scale of quality of life (CASP-19) in the context of research on ageing: A combination of exploratory and conﬁrmatory approaches. Social Indicators Research, 89(1), 61–77 HOWEL, D. (2012): Interpreting and evaluating the CASP-19 quality of life measure in older people. Age and ageing, 41(5), 612–617

Keywords

QUALITY OF LIFE, ANCHORING VIGNETTES, SHARE

32 Testing Lifestyle Theories with Different Data Analyses: Modelling Multiple Behavior Change in Behavioral and Health Care Sciences

Sonia Lippke1, Lena Fleig2, Amelie U. Wiedemann2, and Ralf Schwarzer2

1 Jacobs University Bremen, Campus Ring 1, 28759 Bremen [email protected] 2 Freie UniversitŁt Berlin, Habelschwerdter Allee 45, 14195 Berlin, [email protected] / [email protected] / [email protected]

Abstract. Purpose: The Compensatory Carry-over Action Model theorizes social-cognitive factors facilitating different behaviors. Mechanisms are assumed within one behavior domain (e.g., for physical activity or nutrition). Additionally, the CCAM assumes that compensatory and carry-over mechanisms work between behaviors. The purpose of the current study was to test these assumptions regarding single and multiple behavior change with appropriate data analyses. Methods: N=384 employees of a logistics company completed questionnaires regarding their nutrition and physical activity and their predictors twice with a time lag of 4 weeks. Data analyses were performed with Regression Analyses, Cross-lagged Models, Struc- tural Equation Modelling, Multi-group Modeling applying SPSS and AMOS. Results: The data fit the theoretical structure of the CCAM satisfactory with RMSEA=.059, and assumptions of the CCAM were mainly confirmed: within behaviors, intentions were translated into behavior via planning. Plans for nutrition and for physical activity were moderately interre- lated (r=.37) indicating carry-over mechanisms. Whereas intentions for physical activity seem to influence planning for nutrition, and plans for physical activity appear to influence nutrition behavior, this was not found correspondingly for affecting physical activity by nutrition predictors. Conclusions: These results support assumptions on cross-behavior mechanisms and are in accordance with previous findings indicating physical activity as a gateway behavior. These results are important for designing behavior change interventions, and for modelling multiple behavior change with appropriate data analyses: Future studies should apply such data analyses, too, to better understand lifestyle factors and how to improve them evidence- and theory-based.

Keywords

REGRESSION ANALYSES, CROSS-LAGGED MODELS, STRUC- TURAL EQUATION MODELLING, MULTI-GROUP MODELING

33 7 Invited Session 5: SVM Large Scale Learning

Organized by Claus Weihs

Friday, July 4, 2014: 08:30 - 10:35, West Hall 1 Linear SVM Training with Online Adaptation of Coordinate Frequencies

Tobias Glasmachers

Institut fur¨ Neuroinformatik, Ruhr-Universitat¨ Bochum [email protected]

Abstract. Coordinate descent (CD) algorithms have become the method of choice for solving a number of optimization problems in machine learning. They are particularly popular for training linear models, including linear support vector machine classification. We consider general CD with non-uniform selection of coordinates. Instead of fixing selection frequencies beforehand we propose an online adaptation mechanism for this parameter, called the adaptive coordinate frequencies (ACF) method. This mechanism removes the need to estimate optimal coordinate frequencies beforehand, and it automatically reacts to changing requirements during an optimization run. We demonstrate the usefulness of our ACF-CD approach for a variety of optimization problems arising in large scale machine learning contexts. Our algorithm offers significant speed-ups over state-of-the-art training methods.

Keywords linear SVM training, coordinate descent, adaptive coordinate selection

35 Stochastic gradient algorithms for solving large-scale learning problems

Sangkyun Lee

Computer Science Department (LS VIII). Collaborative Research Center, SFB 876, TU Dortmund, Germany [email protected]

Abstract. Stochastic gradient algorithms have been applied successfully for solving many learning problems, including the support vector machines, ordinary/logistic regression, deep learning, and so on. These methods are characterised by low per-iteration computation complexity and by the fact that learning can be performed incrementally in online settings: these properties make stochastic gradient algorithms well-suited for learning with big data. Also, regularizers can be easily incorporated into the objective function in these methods, so that a certain structure (e.g. sparsity or group sparsity) will be induced in solutions if desired. In this talk we will discuss general ideas, properties, recent developments, and applications of stochastic gradient algorithms.

References

Keywords regularizers, stochastic gradient algorithms, support vector machine

36 A Comparative Study of Kernelized Support Vector Machines

Daniel Horn1, Aydın Demircioglu˘ 2, Bernd Bischl1, Tobias Glasmachers2, and Claus Weihs1

1 Fakultat¨ Statistik, Technische Universitat¨ Dortmund, 44221 Dortmund {bischl, daniel.horn, weihs}@statistik.tu-dortmund.de 2 Institut fur¨ Neuroinformatik, Ruhr-Universitat¨ Bochum, 44780 Bochum, Germany {aydin.demircioglu, tobias.glasmachers}@ini.rub.de

Abstract. Kernelized support vector machines (SVMs) belong to the most widely used classification methods. However, in contrast to linear SVMs, the computation time required to train such a machine becomes a bottleneck when facing large datasets since for each model a quadratic programming problem has to be solved. As soon as the dataset contains hundreds of thousands or even millions of samples, a single training can take up to hours or days. In order to mitigate this shortcoming of kernel SVMs, many approximate training algorithms were developed. While most of these methods claim to be much faster than the state-of-the-art solver LIBSVM, a thorough comparative study is missing. Our contribution will be to fill this gap. We choose several well-known approximate SVM solvers and compare their performance to LIBSVM on a number of large benchmark datasets. Our special focus is to analyze the the trade-off between prediction error and runtime for different learning and accuracy parameter settings. Unsurprisingly, given more runtime most solvers were able to find more accurate solutions, i.e., achieve a higher prediction accuracy. To analyze this trade-off, we apply model-based multicriteria optimization to compute the Pareto front of the two objectives - classification error and training time.

References

1.CHANG, C.-C. and LIN, C.-L. (2011): LIBSVM: A library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Keywords

SUPPORT VECTOR MACHINES, KERNEL, MACHINE LEARN- ING

37 Support Vector Machines for Active Learning

Jan Kremer, Kim Steenstrup Pedersen, and Christian Igel

Department of Computer Science, University of Copenhagen, Universitetsparken 5 2100 København Ø, [email protected]

Abstract. Active learning algorithms autonomously select the data points to learn from. There are many applications in which plenty of unlabeled data are available, but labels are costly to obtain, for example, because labeling involves human annotations or results from complex experiments. In such a case, an active learning algorithm tries to identify data points that, if labeled and used for training, would most improve the learned model. Obtaining labels only for the most promising data points speeds up learning and reduces labeling costs. Support vector machines (SVMs) are particularly well-suited for active learning. They perform linear classiﬁcation, typically in a kernel-induced feature space. Thus, the distance of a data point from the decision boundary can be easily computed. In addition, heuristics can efﬁciently estimate how strongly learning from a data point would change the current SVM model. Based on this information, the learning algorithm can actively select training samples. We give a brief introduction to the active learning problem, discuss different strategies for selecting informative data points, and demonstrate how these strategies lead to different types of actively learning SVMs

References

Keywords

Active learning, selecting informative data points, support vector machine

38 8 Invited Session 6: Predictions with Classiﬁcation Models

Organized by Jozef Pociecha

Friday, July 4, 2014: 11:00 - 12:40, West Hall 1 Dynamic Aspects of Bankruptcy Prediction Models

Jozef´ Pociecha1, Barbara Pawełek2 and Mateusz Baryła3

1 Cracow University of Economics, 27 Rakowicka Street, 31-510 Cracow, Poland [email protected] 2 Cracow University of Economics, 27 Rakowicka Street, 31-510 Cracow, Poland [email protected] 3 Cracow University of Economics, 27 Rakowicka Street, 31-510 Cracow, Poland [email protected]

Abstract. Many types of bankruptcy prediction models have been formulated by the business theory and practice. Among them, a wide group is composed by classification models, which can divide firms’ population into two groups: bankrupts and non-bankrupts. The current bankruptcy prediction models are exclusively based on the company’s internal financial factors which usually have a static character. The aim of the paper is to present the possibility of introducing to bankruptcy prediction models time factor which represents dynamic changes in external economic environment. Some proposals of time factor inclusion to this type of models were tested on data set concerning Polish manufacturing companies at years 2005-2010.

References

ALTMAN, E.I. and HOTCHKISS, E. (2006): Corporate Financial Distress and Bankruptcy: Predict and Avoid Bankruptcy, Analyse and Invest in Distressed Debt. Wiley, Hoboken. BELLOVARY, J.L., GIACOMINO, D.E. and AKERS, M.D. (2007): A review of Bankruptcy Prediction Studies: 1930 to Present. Journal of Financial Education, 33 (4), 3-41. PAWEŁEK, B. and POCIECHA, J. (2012): General SEM Model in Researching Corporate Bankruptcy and Business Cycles. In: J. Pociecha and R. Decker (Eds.): Data Analysis Methods and Its Applications. C.H. Beck, Warsaw, 215-232.

Keywords

CORPORATE BANKRUPTCY, BUSINESS CYCLES, CLASSIFI- CATION MODELS

40 Simple Random Sampling With Replacement as a Technique of Companies Selection in Corporate Bankruptcy Prediction

Mateusz Baryła1, Barbara Pawełek2 and Jozef´ Pociecha3

Abstract. Selecting samples is one of methodological issues in the case of corporate failure prediction. In practice, the most popular approach is based on subjective pairings of available bankrupt companies with non-bankrupt ﬁrms. However, samples obtained in this way are not independent random ones. For this reason, other techniques of sampling should be taken into consideration. In the paper, some bankruptcy prediction models for Polish manufacturing ﬁrms are discussed. Simple random sampling with replacement is used as a method of the selection of companies. A comparative study of the best models (taking into account their prognostic capabilities) obtained as a result of applying two sampling techniques (i.e. pair-matched sampling and random sampling with replacement) is also presented.

References

ALTMAN, E.I. (1968): Financial Ratios, Discriminant Analysis and the Prediction of Corpo- rate Bankruptcy. The Journal of Finance, 23(4), 589-609. POCIECHA J. (2007): Problemy prognozowania bankructwa ﬁrmy metoda¸analizy dyskrymi- nacyjnej (Problems of bankruptcy forecasting using discriminant analysis). Acta Univer- sitatis Lodziensis, Folia Oeconomica, 205, 63-79.

Keywords

CLASSIFICATION MODELS, CORPORATE BANKRUPTCY,BANKRUPTS AND NON-BANKRUPTS SAMPLING

41 The use of hybrid predictive C&RT-logit models in analytical CRM

Mariusz Łapczynski

Cracow University of Economics Department of Market Analysis and Marketing Research [email protected]

Abstract. Predictive models in analytical CRM are closely related to the customer’s lifecy- cle. Prediction of binary dependent variable refers most commonly to areas such as customer acquisition, development (cross- and up-selling) and retention (churn analysis). While building predictive models one usually applies decision trees, logistic regression, support vector machines or ensemble methods, such as different algorithms of boosted decision trees or random forest. Recently one can observe an increasing use of hybrid models in the analytical CRM, i.e. those that combine several different analytical tools, e.g. cluster analysis with decision trees, genetic algorithms with neural networks, or decision trees with logistic regression. The purpose of this article is to present the results of hybrid predictive C&RT-logit models based on three datasets relating to analytical CRM. The ﬁrst model refers to a direct marketing campaign carried out by a Portuguese bank. The second model and the third model pertain to churn analysis and are based on dataset obtained from repository of University of California as well as dataset that was used during KDD cup in 2009.

References

WEI, M. et al. (2008): A Solution to the Cross-Selling Problem of PAKDD-2007: Ensemble Model of TreeNet and Logistic Regression. International Journal of Data Warehousing and Mining, 4/2, 9–14. STEINBERG, D. and CARDELL, N.S. (1998): The hybrid CART-logit model in classiﬁcation and data mining. [Online], Available: http://www.salford-systems.com. LEE, J.S. and LEE, J.C. (2006): Customer Churn Prediction by Hybrid Model. In: X. Li, O.R. Zaiane, Z. Li (Eds.) ADMA 2006, LNAI 4093. Springer-Verlag, Berlin, 959–966

Keywords

HYBRID PREDICTIVE MODELS, C&RT-LOGIT, ANALYTICAL CRM

42 Part IV

Contributed Sessions

9 CON-1A: Machine Learning and Knowledge Discovery I

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 2 On the Inﬂuence of Missing Data Methods on Decision Tree Induction

Kristof Szillat1 and Dieter William Joenssen1

Ilmenau University of Technology, Helmholtzplatz 3, 98693 Ilmenau, Germany [email protected] [email protected]

Abstract. Missing data represents an nearly ubiquitous challenge, not only for the social sciences, but also for knowledge discovery tasks. Thus, the treatment of missing data has found a place in every data mining process model (e.g., CRISP-DM 1.0, the KDD process model). The necessary preprocessing of the raw data leads to different complete data, which is clearly dependent on the missing data method used. Since this process step occurs before the application of data analysis methods, such as the construction of decision trees, the missing data method necessarily inﬂuences the resultant predictions and predictive accuracy. The aim of this paper is to analyze the possible impact of various elimination and imputation procedures on the predictive accuracy of decision trees. To this end, a simulation study is used to demonstrate the inﬂuence of the chosen missing data method on the results of the C4.5 decision tree induction algorithm. The study design includes the variation of not only the missing data method, but also other factors, such as missingness mechanism and the proportion of missing values. Conclusions reached will allow the selection of the missing data method appropriate for the considered situation.

References

BANKHOFER, U. (1995): Unvollstandige¨ Daten- und Distanzmatrizen in der Multivariaten Datenanalyse. Eul, Bergisch Gladbach. BREIMAN, L., FRIEDMAN, J.H., STONE, C. and OLSHEN, R.A. (1998): Classiﬁcation and Regression Trees. CRC Press, Boca Raton. LITTLE, R.J.A. and RUBIN, D.B. (2002): Statistical Analysis with Missing Data. Wiley, Hoboken. TWALA, B. (2009): An Empirical Comparison of Techniques for Handling Incomplete Data Using Decision Trees. Applied Artiﬁcial Intelligence, 23, 373–405.

Keywords

Missing data, C4.5, decision tree, predictive accuracy

46 Bagging Heterogeneous Decision Trees

Fabian Wolff1 and Dieter William Joenssen1

Ilmenau University of Technology, Helmholtzplatz 3, 98693 Ilmenau, Germany [email protected] [email protected]

Abstract. Over the past decades, much research in data analysis has been devoted to the development of ensemble methods. This research resulted in the development methods such as bagging and random forests. These ensemble methods improve predictive models in part by introducing heterogeneity into the model. Bagging achieves this heterogeneity by creating multiple decision trees, each constructed by a bootstrap sample of the training set. Another possibility for introducing heterogeneity is to use different decision tree induction algorithms in creating the voting model ensemble. Mixing decision trees induced by different algorithms would combine the algorithms’ characteristics into a joint model more robust against an individual algorithm’s shortcomings. To quantify possible improvements offered by this increased heterogeneity, a simulation study is preformed. Using the statistical software R, the performance of heterogeneous baggs, constructed using various mixtures of C4.5, CHAID, and One-R, are compared using benchmark data from the UCI Machine Learning Repository. An evaluation of different classiﬁcation quality measures indicate which types of data may proﬁt from heterogeneous baggs.

References

BREIMAN, L. (1996): Bagging predictors. Machine Learning, 24, 123–140. BREIMAN, L., FRIEDMAN, J.H., STONE, C. and OLSHEN, R.A. (1998): Classiﬁcation and Regression Trees. CRC Press, Boca Raton. WITTEN, I., FRANK, E. and HALL, M. (2011): Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington.

Keywords

C4.5, CHAID, Decision Trees, Heterogeneity, Predictive Accuracy

47 Comparison and Statistical Evaluation of Similarity Measures on Concepts in a Lattice

F. Domenach1 and G. Portides1

Computer Science Department, University of Nicosia, 46 Makedonitissas Ave., PO Box 24005, 1700 Nicosia, Cyprus domenach.f,[email protected]

Abstract. This paper falls within the framework of Formal Concept Analysis which provides classes (the extents) of objects sharing similar characters (the intents), a description by attributes being associated to each class. In a recent paper by the ﬁrst author, a new similarity measure between two concepts in a concept lattice was introduced, allowing for a normaliza- tion depending on the size of the lattice. In this paper, we compare this similarity measure with existing measures, either based on car- dinality of sets or originating from ontology design and based on the graph structure of the lattice. A statistical comparison with the existing methods is carried out, and the output of the measure is tested for consistency.

References

DOMENACH, F. (2014): Similarity Measures of Concept Lattices, to appear in proceeding of ECDA 2013 GANTER, B. and WILLE, R. (1999): Formal Concept Analysis : Mathematical Foundations. Springer.

Keywords

FORMAL CONCEPT ANALYSIS, LATTICE, SIMILARITY

48 Quadtree decomposition for textual information localization in document images

Cynthia PITOU1 and Jean DIATTA2

1 LIM-EA2525, University of Reunion Island [email protected] 2 LIM-EA2525, University of Reunion Island [email protected]

Abstract. Text localization [3] is a challenging issue in Information Retrieval [2]. Two main approaches are commonly distinguished: texture-based and region-based ones. In this paper, we propose a region-based method guided by quadtree decomposition[1]. The principle of the method is to decompose the images of documents in four equals regions and each regions in four new regions and so on. Then, with a free OCR engine, we try to extract precise textual information in each region. A region containing a number of expected textual information is not decomposed further. Our method allows to determine accurately in document images, the regions containing text information that one wants to locate and retrieve quickly and efﬁciently. First experiments demonstrate the validity of the proposed method for textual information localization, on structured document images and suggest its potential as a part of a generic system for automatic document reading. The next step of our work will concern categorization applied to deﬁned regions.

Keywords

TEXT EXTRACTION, QUADTREE DECOMPOSITION, TEXT LO- CALIZATION

References

[1]M. Manouvrier, M. Rukoz, and G. Jomier. A generalized metric distance between hierarchically partitioned images. In Proceedings of the 6th International Workshop on Mul- timedia Data Mining: Mining Integrated Media and Complex Data, MDM ?05, pages 33–41. ACM, 2005. [2]R. J. Mooney and R. Bunescu. Mining knowledge from text using information extraction. SIGKDD Explor. Newsl., 7(1):3–10, June 2005. [3]C. Wolf, J. Jolion, and F. Chassaing. Text localization, enhancement and binarization in multimedia documents. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 1037–1040 vol.2, 2002.

49 An Ensemble of Optimal Trees for Class Membership Probability Estimation

Zardad Khan1, Asma Gul1, Osama Mahmoud1, Miftahuddin Miftahuddin1, Aris Perperoglou1, Werner Adler2, and Berthold Lausen1

1 Department of Mathematical Sciences, University of Essex, Colchester, UK. [email protected] 2 Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, Germany

Abstract. Machine learning methods can be used for estimating the class membership probability of an observation. We propose an ensemble of optimal trees in terms of their predictive performance. This ensemble is formed by selecting the best trees from a large initial set of trees grown by random forest. A proportion of trees is selected on the basis of their individual predictive performance on out-of-bag observations. The selected trees are further assessed for their collective performance on an independent training data set. This is done by adding the trees one by one starting from the highest predictive tree. A tree is selected for the ﬁnal ensemble if it increases the predictive performance of the previously combined trees. The proposed method is compared with probability estimation tree, random forest and node harvest on a number of bench mark problems using brier score as a performance measure. In addition to reducing the number of trees in the ensemble, our method gives better results in most of the cases. The results are supported by a simulation study.

References

KRUPPA, J., ZIEGLER, A. and KONIG, I. R. (2012): Risk Estimation and Risk Prediction Using Machine-Learning Methods. Human Genetics, 131, 1639-1654. MALLEY, J., KRUPPA, J., DASGUPTA, A., MALLEY, K. and ZIEGLER, A. (2012): Probability Machines: Consistent Probability Estimation Using Nonparametric Learning Machines. Meth- ods of Information in Medicine, 51, 74–81. MEINSHAUSEN, N., (2010): Node Harvest. The Annals of Applied Statistics, 4, 2049-2072.

Keywords

TREE SELECTION, ENSEMBLE METHODS, PROBABILITY ESTIMATION TREES

50 10 CON-1B: Data Analysis in Finance I

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 3 Assessing systemic risk with Dynamic Conditional Beta approach

Kuziak Katarzyna

Department of Financial Investments and Risk Management, Wroclaw University of Economics, ul. Komandorska 118/120, 53-345 Wroclaw, Poland [email protected]

Abstract. In this paper systemic risk will be meant as a risk of breakdown or major dysfunc- tion in financial markets. There are four approaches to measuring systemic risk (Hansen 2013). One approach measures codependence in the tails of equity returns to financial institutions (Adrian, Brunnermeier 2011). The next is known as contingent claims analysis (Gray, Jobst 2011). The third one are network models of the financial system. The last one are dynamic stochastic equilibrium models (eg. Christiano et al. 2005). In the paper approach proposed by Robert Engle (2012) is considered. This new method to estimate time series regressions that allow for time variation in the regression coefficients is called Dynamic Conditional Beta or DCB (Bali et al. 2014). Empirical evidence for Polish financial system will be given.

References

ADRIAN, T., BRUNNERMEIER, M.K. (2011): CoVaR,. Technical Report, Federal Reserve Bank of New York, Staff Reports no.348. BALI, T.G., ENGLE, R.F., TANG, Y. (2014): Dynamic Conditional Beta is Alive and Well in the Cross-Section of Daily Stock Returns, available at http://papers.ssrn.com abstract no 2089636. CHRISTIANO, L.J., EICHENBAUM, M., EVANS, Ch.L. (2005): Nominal Rigidities and the Dynamic Effects of a Shock to Monetary Policy. Journal of Political Economy, 113 (1): 1–45. ENGLE, R. (2012): Dynamic Conditional Beta, available at http://papers.ssrn.com abstract no 2084872. GRAY, D.F., JOBST, A.A. (2011) Modelling Systemic Financial Sector and Sovereign Risk. Sveriges Riksbank Economic Review, 2, 68–106. HANSEN, L.P. (2013): Challenges in Identifying and Measuring Systemic Risk available at http://www.nber.org/chapters/c12507.pdf.

Keywords

SYSTEMIC RISK, MES, SRISK, CoVaR, DCB

52 Power of skewness tests in the presence of fat tailed ﬁnancial distributions

Krzysztof Piontek

Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected]

Abstract. The presumption of symmetry or asymmetry of financial return distributions is assumed in many financial issues (portfolio selection, risk management, option pricing). Testing skewness is still an open and significant issue. The mostly used test of skewness is the Jarque-Bera approach. However, this test is not reliable in the presence of leptokurtosis that is observed in financial data. The goal is to investigate the power of some skewness tests when applied to fat-tailed (typical for finance) return distributions. Four approaches are discussed in respect of testing skewness of distributions: classical and adjusted (taking fatter tails into consideration) Jarque- Bera test, test based on the Pearson type IV distribution and Peiro test without any assumption about the type of distribution. In the empirical part, the power of each test is estimated by using Monte Carlo simulations. Different asymmetric and fat tailed distributions are used to data generation. The frequency of rejecting a null hypothesis (of symmetry of the distribution, if it is false) is used as an approximate value of the power of test. Data series of different number of observations and different skewness values are simulated. The last part summarizes results, compares values obtained by using different test methods and gives hints for risk managers.

References

ASAI, M. and DASHZEVEG, U. (2008): Distribution-Free Test for Symmetry with an Applic. to S&P Index Returns. Applied Economics Letters, 15(6),461–464. BERA, A., PREMARATNE, G. (2001): Adjusting the Tests for Skewness and Kurtosis for Distributional Misspeciﬁcations, UIUC-CBA Research WP No. 01-0116. BRYS, G., HUBERT, M., STRUYF, A. (2003): A comparison of some new measures of skewness. Developments in Robust Statistics, 98–113.

Keywords

TESTS OF SYMMETRY, RETURN DISTRIBUTIONS, FAT TAILS

53 A Hidden Markov Model to detect relevance in ﬁnancial documents based on on/off topics

Dimitrios Kampas1, Christoph Schommer1, and Ulrich Sorger1

University of Luxembourg, Dept. of Computer Science and Communication, {dimitrios.kampas, christoph.schommer, ulrich.sorger}@uni.lu

Abstract. Automated text classification has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Most of the standard classification posit the independence of the terms-features in document, which is unreal- istic considering the sophisticated structure of the language. Our research concerns the discovery of relevance in documents, which adequately refers to a sufficient number of thematic themes (or topics) that are either ‘on’ or ‘off’. ‘On topics’ are semantically close with a domain specific discourse, whereas ‘Off topics’ are not considered to be on documents. As a rather promising approach, we have modelled a stochastic process for term sequences, where each term is conditionally dependent of its preceeding terms. Hidden Markov Models hereby provide a reliable potential to incorporate language and domain dependencies for a classification. Terms are deterministically associated with classes to improve the probability estimates for the infrequent words. In the paper presentation, we demonstrate our approach and motivate its eligibility by the exploration of annotated Thom- son Reuters news documents; in particular, the ‘on topic’ documents discourse the monetary policy of Federal Reserves. We estimate the transition and emission probabilities of our model on a training set of both on and off topic documents and evaluate the accuracy of our approach using 10-fold cross validation. This work is part of the interdisciplinary research project ESCAPE, which is funded by the Fonds National de la Recherche. We kindly thank our colleagues from the Dept. of Finance for their support.

References

[EAM]R. J. Elliott, L. Aggoun, and J. B. Moore. Hidden Markov models. Applications of Mathematics. Springer-Verlag, New York, 1995 (Vol. 29). [KS]D. Kampas, C. Schommer. A Hybrid Classiﬁcation System to Find Financial News that is relevant. Proceedings of the European Conference on Data Analysis (2013).

Keywords

RELEVANCE,HIDDEN MARKOV MODELS, FINANCIAL NEWS.

54 Born print, reborn digital - the Hoppenstedt Data Archive

Irene Schumm, Sebastian Weindel, and Philipp Zumstein

Mannheim University Library, 68131 Mannheim, Germany {irene.schumm, sebastian.weindel, philipp.zumstein}@bib.uni-mannheim.de

Abstract. Access to data sources is a crucial factor for empirical research and will gain further importance in the future. While current as well as historical financial data are easily available for US companies, the same is not the case for German companies. Our goal is to establish a widely accessible database for reliable financial data of in Germany?s stock markets listed companies over a large time period. The yearly published ?Aktienfu?hrer Hoppenstedt? is a commonly used source for financial data of German companies, but older volumes are only available in printed books. Moreover, these books are protected by copyright. We overcome both limitations by transforming the information into a database open for research. The final result of the project is a database in which researchers in Germany can query and export the collected data of stock companies for two decades (1979-1999). In the spirit of open science, the research findings based on this data can easily be replicated and validated by other researchers as well. The project is funded by the German Research Foundation (DFG).

References

[1]ARTMANN, S., FINTER, P., KEMPF, A., KOCH, S. and THEISSEN, E. (2012): The Cross-Section of German Stock Returns: New Data and New Evidence. Schmalenbach Business Review 64, 20–42. [2]BETZER, A. and THEISSEN, E. (2010): Sooner or Later: Delays in Trade Reporting by Corporate Insiders. Journal of Business Finance and Accounting 37, 130–147. [3]THEISSEN, E. and ANDRES, C. (2008): Setting a Fox to Keep the Geese: Does the Comply-or-Explain Principle Work? Journal of Corporate Finance 14, 289–301.

Keywords

DATA PREPARATION, DIGITIZING INFORMATION, DATA ARCHIVE, STOCK MARKET LISTED COMPANIES, COMPANY DATA, PANEL DATA,

55 Experimental design in evaluation of VaR independence tests

Marta Maecka1

University of Lodz, Department of Statistical Methods, Poland [email protected]

Abstract. Statistical interference about VaR (Value-at-Risk) models can be based on unconditional or conditional distribution of VaR failures. The traditional unconditional approach is aimed at checking whether the overall ratio of VaR failures is consistent with the assumed tol- erance level. According to the conditional approach a good VaR model produces independent exceedances, which implies testing the independence property. The paper investigates statistical properties of VaR independence tests through the simulation study. The focus of the study is on experimental design in using the Monte Carlo method. Simulation studies relating to risk management are usually based on GARCH-model experiments. In the paper we proposed to extend the analysis using various simulation experiments, which reﬂect the volatility clustering phenomenon and produce serially correlated squared returns. Two simulation experiments proposed in the study are based on the BGAR and BGMA processes, which use the beta-gamma transform and properties of the beta and gamma distributions [Lewis, McKenzie, Hugus 1989]. Moreover we applied the Markow chain based experiment. The simulation study showed that the results relating to the size and the power of the tests differed over the experiments. GARCH-based experiments were speciﬁc in the sense that they gave the lowest power estimates. The test ranking was also dependent on chosen experiment model.

References

BERKOWITZ, J., CHRISTOFFERSEN, P. and PELLETIER D. (2011): Evaluating Value-at- Risk Models with Desk-Level Data. Management Science, 12, 2213-222. LEWIS, P.A.W., MCKENZIE, E. and HUGUS D.K. (1989): Gamma processes. Comm. Statist. Stochastic Models, 5, 1–30.

Keywords

VaR TEST, INDEPENDENCE TEST, EXPERIMENTAL DESIGN

56 11 CON-1C: Statistics and Data Analysis I

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 4 Prediction of Upper Esophageal Sphincter Restitution Time in Aberrant Swallows

Nicolas Schilling1, Andre Busche2, Simone Miller3, Michael Jungheim3, Martin Ptok3, and Lars Schmidt-Thieme1

1 University of Hildesheim, Information Systems and Machine Learning Lab {schilling,schmidt-thieme}@ismll.uni-hildesheim.de 2 Brunel Communications, Daimlerring 9, Hildesheim {a.busche}@brunel.net 3 Medizinische Hochschule Hannover, Klinik fur¨ Phoniatrie und Padaudiologie¨ {Miller.Simone,Jungheim.Michael,Ptok.Martin}@mh-hannover.de

Abstract. In preliminary work [Schilling, 2013], we have established a machine learning model for the calculation of upper esophageal spincter (UES) restitution time in normal swallows. In this paper, we advance our work to predict restitution times for aberrant swallows of different pH values, namely acid swallows, as UES reactions to acid during swallow or reflux events are not entirely proven [Vardar, 2012]. We analyze acid swallows measured in a controlled environment. These already show a significantly different pattern, justifiying both an in-depth analysis and requiring a special consideration while modelling the task. The proposed improvement is twofold: At first, we enhance the model’s performance by choosing from a more general class of models, namely Factorization Machines [Rendle, 2010], to also model categorical features such as the pH value. Secondly, we enhance hyperparameter choice to speed up the learning process, making it more applicable for real-world scenarios. Preliminary results on a large corpus of over 500 annotated swallows empirically prove the effectiveness of our approach.

References

Schilling, N. et al. (2013): Event Prediction in Pharyngeal High-Resultion Manometry. Proceedings of the European Conference on Data Analysis, ECDA 2013, Springer, Luxemburg. Vardar, R. et al. (2012 ): Upper esophageal sphincter and esophageal motility in patients with chronic cough and reﬂux: assessment by high-resolution manometry. Ofﬁcial Journal of the International Society for Diseases of the Esophagus Rendle, S. (2010): Factorization Machines. Proceedings of the IEEE 10th International Con- ference on Data Mining (ICDM), 2010 pp. 995–1000.

Keywords Sequence Labeling, Factorization Machines, Pharyngeal Manometry

58 Exploiting longitudinal epidemiological data in similarity-based classiﬁcation

T. Hielscher1, H. Volzke¨ 2, J.-P.-Kuhn¨ 2, M. Spiliopoulou1

1 Otto-von-Guericke University Magdeburg [email protected], [email protected] 2 University Medicine Greifswald {voelzke,kuehn}@uni-greifswald.de

Abstract. Characterizing risk factors for diseases or disorders is one goal in epidemiology [1]. Epidemiological studies consider vast numbers of participant assessments in order to find such factors which can serve as valuable knowledge for diagnosis. However, their identification is challenging. Risk factors may be only important for small subpopulations and their association strength with the disorder under study can vary greatly. Additionally, potential knowledge provided by same-assessment sequences obtained in longitudinal studies is often left unexplored. In our contribution, we present a workflow to identify important features and distinct subpopulations from longitudinal epidemiological data for a multifactorial disorder. Based on a workflow where only latest study recordings are considered [2], we show how past recordings and adjusted participant similarity measures can improve class separation quality. We report the workflow’s performance for the disorder ”hepatic steatosis” on the Study of Health in Pomerania [3], using the data made available under the cooperation SHIP/2012/06/D Predictors of Steatosis Hepatis.3

References

1. Preim, B., Klemm, P., . . . , Oeltze, S., Toennes, K. and Volzke¨ H. (2014): Visual Analytics of Image-Centric Cohort Studies in Epidemiology. In: Linsen, L., Hamann, B. and Hege, H.-C. (Eds.): Visualization in Medicine and Life Sciences III. Springer, Berlin, pages in print. 2. Hielscher, T., Spiliopoulou, M., Volzke,¨ H. and Kuhn,¨ J.-P. (2014): Using Participant Sim- ilarity for the Classiﬁcation of Epidemiological Data on Hepatic Steatosis. In: Proc. of 27th IEEE Int. Symp. on Computer-Based Medical Systems (CBMS’14). accepted 03/2014, to appear. 3. Volzke,¨ H., Alte, D., . . . , Biffar, R., John, U. and Hoffmann, W. (2011): Cohort Proﬁle: the Study of Health In Pomerania. Int. J. of Epidemiology, 40, 294–307.

3 Part of this work was supported by the German Research Foundation project SP 572/11-1 ”IMPRINT: Incremental Mining for Perennial Objects”.

59 60 T. Hielscher, H. Volzke,¨ J.-P.-Kuhn,¨ M. Spiliopoulou

Keywords:

Medical mining, mining longitudinal epidemiological data, patient similarity Incremental Generalized Canonical Correlation Analysis

Angelos Markos1 and Alfonso Iodice D’Enza2

1 Department of Primary Education, Democritus University of Thrace, Greece [email protected] 2 Department of Economics and Law, Universita` di Cassino e del Lazio Meridionale, Italy [email protected]

Abstract. Generalized canonical correlation analysis (GCANO) is a versatile technique that allows the joint analysis of several sets of data matrices through data reduction. The method embraces a number of representative techniques of multivariate data analysis as special cases (Takane et al., 2008). When all data sets consist of indicator variables GCANO specializes into Correspondence Analysis (simple and multiple), and into Principal Component Analysis when each of the data sets consists of a single continuous variable. In the case of two data sets with continuous variables, GCANO reduces to canonical correlation analysis, and when one of the two sets of variables consists of indicator variables, the method specializes into canonical discriminant analysis or MANOVA. GCANO can also be viewed as a method for data fusion from disparate sources and has recently found applications in large scale scenarios. The GCANO solution can be obtained noniteratively through an eigenequation and distributional assumptions are not required. The high computational and memory requirements of ordinary eigendecomposition makes its application impractical on massive or sequential data sets. The aim of the present contribution is two-fold: i) to extend the family of GCANO techniques to a split-apply-combine framework, that leads to exact and parallel implementations; ii) to allow for incremental updates of existing solutions, which lead to approximate yet highly accurate solutions (see Iodice D’ Enza and Markos, 2014). For this purpose, an incremental SVD approach with desirable properties is revised and embedded in the context of GCANO, and extends its applicability to modern big data problems and data streams.

References

IODICE D’ ENZA, A. and MARKOS, A. (2014): Low-dimensional tracking of association structures in categorical data. Statistics and Computing (in press). TAKANE, Y., HWANG, H. and ABDI, H. (2008): Regularized multiple-set canonical correlation analysis. Psychometrika, 73 (4), 753–775.

Key words: Singular value decomposition, Incremental methods, Di- mensionality reduction

61 Weighted Rank Correlation Measures Based on Fuzzy Order Relations

Sascha Henzgen and Eyke Hullermeier¨ 1

Computer Science Institute, University of Paderborn, Germany

Abstract. Rank correlation measures are widely used in a broad range of applications. Al- though a weighing of the rank positions according to their importance is desirable in many of these applications, work on theoretically well-founded extensions of conventional (un- weighted) rank correlation measures is still very limited. Here, we develop a formal framework for designing weighted rank correlation measures based on the notion of fuzzy order relation (Bodenhofer and Demirci, 2008), that is, generalizations of the conventional SMALLER, EQUAL and GREATER relations on the real numbers. Fuzzy order relations allow for expressing that a position n is higher than m to a certain degree, while to some degree these positions are also considered as being equal. On the basis of relations of that kind, fuzzy rank correlation measures have been proposed as generalizations of conventional rank correlation (Dolorez Ruiz and Hullermeier,¨ 2012). The idea of our approach is to use so-called scaling functions (Klawonn, 1994) in order to define a fuzzy equivalence relation EQUAL on the domain of ranks {1,...,N}, and then to apply a fuzzy rank correlation measure on this domain equipped with the fuzzy ordering induced by EQUAL. Roughly speaking, for each position n, the scaling function s(·) defines the degree of importance s(n) of this position, i.e., of not reversing the items on positions n and n + 1. We show that our framework accommodates a number of existing measures as special cases while also suggesting new ones in a quite natural manner. Moreover, we show that all rank correlation measures defined through appropriate scaling functions exhibit desirable mathematical properties.

References

BODENHOFER U. and DEMIRCI, M. (2008) Strict fuzzy orderings with a given context of similarity. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, 16(2):147– 178. KLAWONN, F. (1994) Fuzzy sets and vague environment. Fuzzy Sets and Systems, 66:207– 221. DOLOREZ RUIZ M. and HULLERMEIER,¨ E. (2012) A formal and empirical analysis of the fuzzy gamma rank correlation coefﬁcient. Inform. Sciences, 206:1–17.

62 12 CON-1D: Data Analysis in Interdisciplinary Domains (Musicology I)

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 5 Fast Model Based Optimization of Tone Onset Detection by Instance Sampling

Nadja Bauer, Klaus Friedrichs and Claus Weihs

TU Dortmund, Chair of Computational Statistics {bauer, friedrichs, weihs}@statistik.tu-dortmund.de

Abstract. There exist several algorithms for tone onset detection, but finding the best one is a challenging task, as there are many categorical and numerical parameters to optimize. Apparently, the target of this task is to detect as many true onsets as possible while avoiding false detections (e.g., f-measure). In recent years, model based optimization (MBO) has been introduced for solving similar problems. After evaluating the points of an initial design – each point represents one possible algorithm configuration – the main idea is a loop of two steps: firstly, updating a surrogate model (e.g., kriging), and secondly, proposing a new promising point for evaluation. While originally this technique has been developed mainly for numerical parameters, here, it needs to be adapted for optimizing categorical parameters as well. Hence, the first point of this work is comparing different MBO techniques for optimization onset detection. Especially, the input of the surrogate model (kriging vs. random forest) and the size of the initial design are investigated. Unfortunately, each optimization step is very time-consuming, since the evaluation of each new point has to be performed on a large data set of music instances for getting realistic results. Nevertheless, many bad configurations could be rejected much faster, since their expected performance regarding to a statistical model is very low after evaluating them on just a small partition of instances. Hence, the basic idea is to evaluate each proposed point on a small sample and only evaluate on the whole data set if the results seem to be promising. Instead of using a random sample, this method is further improved by sampling representative music instances with respect to the prediction of the expected performance.

Keywords

ONSET DETECTION, MODEL BASED OPTIMIZATION, INSTANCE SAMPLING

64 Recognition of leitmotives in Richard Wagner’s music: chroma distance and listener expertise

Daniel Mullensiefen¨ 1, David Baker1, Christophe Rhodes1, Tim Crawford1, and Laurence Dreyfus2

1 Goldsmiths, University of London {d.mullensiefen,ps301db,c.rhodes,t.crawford}@gold.ac.uk 2 University of Oxford [email protected]

Abstract. The leitmotives in Richard Wagner’s Der Ring des Nibelungen serve a range of compositional and psychological functions, including the introduction of musical structure and mnemonic devices for the listener. Leitmotives in the Ring differ greatly in their construction, salient aspects (e.g. rhythmic, melodic, harmonic), and their usage in particular scenes and contexts. We aim to understand listeners’ real-time processing of leitmotives, and have gathered data from a memory test, probing participants’ memory for different leitmotives contained in a 10-minute excerpt from the opera Siegfried. An item response theory (IRT) approach was used to estimate item difficulty parameters as well as parameters characterizing participants’ individual recognition ability. We fit a series of IRT models to the data obtained from 68 participants, finding that a Rasch Model with an unconstrained but fixed discrimination parameter fit the data best according to the Bayesian Information Criterion. We further investigated the relationship between model parameters and factors such as: number of leitmotive occurrences in the excerpt; acoustical distance using chroma features (Mauch & Dixon, 2010) and distance thresholding (Casey, Rhodes & Slaney, 2008); extent of musical training; and objective and self-reported Wagner expertise, finding that performance in the objective Wagner test and chroma distance were statistically significant predictors, while number of occurrences, self- reported Wagner expertise and extent of musical training did not reach significance. References

MAUCH, M. and DIXON, S. (2010): Approximate Note Transcription for the Improved Iden- tiﬁcation of Difﬁcult Chords. In: Proc. International Society for Music Information Re- trieval Conference, Utrecht, Netherlands, 135–140 CASEY, M., RHODES, C. and SLANEY, M (2008): Analysis of Minimum Distances in High- Dimensional Musical Spaces. IEEE Transactions on Audio, Speech and Language Pro- cessing, 16:5, 1015–1028 Keywords

MUSIC, MEMORY, ITEM RESPONSE THEORY, LEITMOTIVES

65 The Surprising Character of Music. A Search for Sparsity in Music Evoked Body Movements.

Denis Amelynck1, Pieter-Jan Maes2, Marc Leman3, and Jean-Pierre Martens4

1 Ghent University, Belgium [email protected] 2 Ghent University, Belgium [email protected] 3 Ghent University, Belgium [email protected] 4 Ghent University, Belgium [email protected]

Abstract. The high dimensionality of music evoked movement data makes it difficult to uncover the fundamental aspects of human music-movement associations. However, modeling these data via Dirichlet Process Mixture (DPM) Models facilitates this task considerably. In this manuscript we present DPM models to investigate positional and directional aspects of music evoked bodily movement. In an experimental study subjects were moving spontaneously on a musical piece that was characterized by passages of extreme contrasts in physical acoustic energy. The contrasts in acoustic energy caused surprise and triggered new gestural behavior. We used sparsity as key indicator for surprise and rendered it visible in two ways. Firstly, we analyzed positional data using a Dirichlet Process Gaussian Mixture Model (DPGMM). The model subdivides the positional data in a number of Gaussian clusters. To link with sparsity we must understand what is common (large clusters) en what is special (small clusters). For example, small cluster analysis localized changes in the harmonic structure of the music where a major cord (happy) changed into a minor chord (sad). Large cluster analysis revealed that (for the present music) hand movement mainly happened on two dimensional manifolds tangent to the surface of an ellipsoid. Any movement violating this “rule” can be classified as sparse and is a possible indicator of surprise. Secondly, we analyzed directional data using a Dirichlet Process Multinomial Mixture Model (DPMMM). We defined directional data as a mixture of inter alia, left/right and up/down movement in a fixed time interval. The model unveiled a dominant directional mix for the low energetic acoustic parts but random directional behavior in the high energetic acoustic parts. Eventually the results from all subjects were consolidated in a directogram. This is a diagram revealing the “directional movement” characteristics of an entire musical excerpt: Persistency, along the diagonal, answers questions like how long subjects move similarly (in terms of direction). Consistency, off-diagonal, indicates if subjects are moving similarly in non-adjacent time intervals. As such the diagram helps to uncover the surprising or unpre- dictable character of the musical excerpt.

66 Title Suppressed Due to Excessive Length 67 References

HURON, D. (2006): Sweet anticipation: Music and the psychology of expectation. MIT press, Cambridge Mass. LEMAN, M. (2008): Embodied music cognition and mediation technology. The MIT Press, Cambridge, Mass. Teh, Yee Whye (2010): Encyclopedia of Machine Learning: Dirichlet Process. Springer, Ger- many, 280-287.

Keywords

MUSIC, EMBODIMENT, MACHINE LEARNING, DIRICHLET PRO- CESS MIXTURE An iterative learning approach to dataset demarcation in music analysis

Dan Tidhar, Srikanth Cherla, Daniel Wolff, and Tillman Weyde

City University London {dan.tidhar.1, srikanth.cherla.1, daniel.wolff.1, t.e.weyde}@city.ac.uk

Abstract. Music Information Retrieval methods have improved in recent years to the extent that enables their application to large audio collections and with reliable results for musicological research [1]. However, the problem of determining which data items in a large collection are relevant to a certain analysis is difficult when the metadata are partial or inconsistent. For example, an theory referring to piano solo performance could benefit from a large-scale analysis of piano solo recordings, but the inclusion of other recordings in the analysis (e.g. with other instruments, or not including piano) will damage its validity. Fully automatic dataset demarcation would require either a reliable audio-based instrumentation recognition algorithm or high quality metadata, and both are often not available. In this paper we describe an iterative approach as a quick and practical solution to the dataset demarcation problem with a limited amount of human annotations. We use a mixture of features, extracted from textual metadata and audio, to predict semantically meaningful labels for audio data. We present the method with the example of the CHARM dataset http://www.charm.rhul.ac.uk/ which consists of 4873 audio recordings with partial and sometimes inconsistent metadata. Initially, a small number of data points is classified by a human annotator. A neural network is trained on this subset to predict a probability distribution over the labels given the input features over the remaining collection. We use entropy as a measure of the uncertainty of each of the network’s predictions. The most uncertain data items are presented next to the an annotator, with the audio automatically available through a web interface. The annotator then manually classi- fies the low-certainty data points and the neural network is re-trained until a desired accuracy is reached using held out labelled data.

References

[1] WIERING, F. and BENETOS, E. (2013): Digital musicology and MIR: papers, projects, and challenges, ISMIR 2013 Late-breaking session.

Keywords

Data Analysis, Machine Learning, Autotagging, Online Learning

68 13 CON-1E: Data Analysis in Marketing I

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 6 Should ﬁnite mixture conjoint choice models account for utility interdependencies?

Friederike Paetz1 and Winfried J. Steiner2

1 Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected] 2 Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected]

Abstract. Estimation and analysis of segment-specific consumers preferences using choice- based conjoint analysis is nowadays well-established. While the popular Finite Mixture Multi- nomial Logit (FM-MNL) model assumes independent error terms and suffers from the IIA property, the Finite Mixture Multinomial Probit (FM-MNP) model is known to relax those assumptions. Using a simulation study, we compare the FM-MNP model to its nested version, the Finite Mixture Independent Probit (FM-IP) model. The FM-IP model results from constraining the covariance matrix to the identity matrix. Model performance is assessed in terms of fit, parameter recovery and forecasting accuracy. While our results indicate a better performance of the FM-MNP model concerning (unpenalized) model fit and parameter recovery, only minor differences between the models are found regarding forecasting accuracy. Furthermore, we analyzed the influence of several experimental factors (covariance structure, number and separation of segments, relative segment masses) on the performance measures used. Some of our results are in line with findings for the FM-MNL model obtained in previous simulation studies (cf. Andrews et al. 2002).

References

ANDREWS, R.L., AINSLIE, A. and CURRIM, I. (2002): An Empirical Comparison of Logit Choice Models with Discrete versus Continuous Representations of Heterogeneity. Jour- nal of Marketing Research, 39 (4), 479–487 HAAIJER, R., WEDEL, M., VRIENS, M. and WANSBEEK, T.J. (1998): Utility Covariances and Context Effects in Conjoint MNP Models. Marketing Science, 17 (3), 236–252 VRIENS, M., WEDEL, M. and WILMS, T. (1996): Metric Conjoint Segmentation Methods: A Monte Carlo Comparison. Journal of Marketing Research, 33 (1), 73–85

Keywords

CONJOINT ANALYSIS, CHOICE MODELS, PROBIT MODELS, UTILITY DEPENDENCIES

70 Quality evaluation of microeconometric models used in consumer preferences analysis

Tomasz Bartłomowicz1 and Andrzej Ba¸k2

1 Wrocław University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected] 2 Wrocław University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected]

Abstract. In studies conducted in consumer preferences using discrete choice methods there are applicable following models for unordered outcomes: MNLM – MultiNomial Logit Model, CLM – Conditional Logit Model, MLM Mixed Logit Model and LCM – Latent Class Models. Because the criteria of models quality often point to other models the choice of model is not simple. The main aim of the paper is to present the criteria of quality evaluation of microeconometric models, like information criteria AIC – Akaike Information Criterion and BIC – Bayesian Information Criterion, McFadden’s coefﬁcient of determination (McFadden R2), McFadden’s adjusted coefﬁcient of determination (adjusted McFadden R2) and other criteria related to the characteristics of the model (number of variables, number of classes). Com- parison of criteria of various models (estimated on the basis of empirical data and simulation) should allow the selection of the optimal strategy for choosing the quality of microeconometric models used in studies of consumer preferences. In the calculations will be used R packages, including authoring DiscreteChoice R package using in measurement of consumers’ preferences.

References Ba¸k A. (2013): Mikroekonometryczne metody badania preferencji konsumentw z wykorzystaniem programu R. Wydawnictwo C.H. Beck, Warszawa. Ba¸k A., Bartłomowicz T., (2013a): Discrete choice multinomial models package Discrete- Choice. http://keii.ue.wroc.pl/DiscreteChoice. Ba¸k A., Bartłomowicz T., (2013b): Mikroekonometryczne modele wielomianowe i ich zas- tosowanie w analizie preferencji z wykorzystaniem programu R. Prace Naukowe UE we Wrocawiu nr 278, p. 169-179. Cameron A.C., Trivedi P.K. (2009), Microeconometrics. Methods and Applications. Cam- bridge University Press, New York. Gagne P., Dayton C.M. (2002), Best Regression Model Using Information Criteria. Cambridge University Press, New York.Journal of Modern Applied Statistical Methods, Vol. 1, No 2, p. 479-488. Piatowska M. (2011), Information and Prediction Criteria in Selecting the Forecasting Mod- els. Dynamic Econometric Models, Vol. 11, s. 21-38.

71 72 Tomasz Bartłomowicz and Andrzej Ba¸k Keywords

CRITERIA OF MICROECONOMETRIC MODELS SELECTION, STATED PREFERENCES, R PROGRAM. Wine consumer preference analysis with application of conjoint package of R

Aneta Rybicka1 and Marcin Pełka1

Wrocław University of Economics, Department of Regional Economics, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected], [email protected]

Abstract. Conjoint analysis is a statistical technique used in market research to determine how people value different features that make up an individual product or service. This method presents set of different profiles of goods or services (real or not) described by attributes to the respondents. On the bases respondents preferences a decomposition approach is conducted to extract share of each attribute in whole profiles utility. The aim of the paper is to present an application of conjoint package of R software to evaluate Wine consumer preferences in Poland. The conjoint package allows to pre- pare and evaluate conjoint studies. In particular it allows to evaluate part-worth utilities, each attribute’s importance, participation (market share) of simulation profiles.

References

BA¸K, A., BARTŁOMOWICZ, T. (2013): The conjoint package. [URL:] www.r- project.org BA¸K, A. (2013): Mikroekonometryczne metody badania preferencji konsumentow´ z wykorzystaniem programu R. [Microeconometric methods of consumer preferences analysis with application of R software]. C.H. Beck, Warszawa. GUSTAFSSON, A. HERRMANN, A., (Eds.) (2000): Conjoint measurement: methods and applications. Springer-Verlag, Berlin. EVERITT, B.S. (2004): An R and S-PLUS Companion to Multivariate Analysis. Springer- Verlag, London. GREEN, P.E., SRINIVASAN, V. (1978): Conjoint Analysis in Consumer Research: Issues and Outlook. Journal of Consumer Research, September, 5:103-123.

Keywords

CONJOINT ANALYSIS, R SOFTWARE, PREFERENCE ANALY- SIS

73 Casting the Net: Category Spillover Effects in Crowdfunding Platforms

Dieter William Joenssen1 and Thomas Mullerleile¨ 1

Ilmenau University of Technology, Helmholtzplatz 3, 98693 Ilmenau, Germany [email protected] [email protected]

Abstract. Crowdfunding is a process where commercial or non-commercial projects are initiated in a public announcement by organizations or individuals to receive funding, assess the market potential, and build customer relationships. Pledgers may then contribute individual amounts of monetary or non-monetary resources, during a specified time-frame, using offline or online campaign platforms that utilize different payout schemes, in exchange for a product specific or unspecific, material or immaterial reward. Recently, special purpose platforms have emerged in categories such as music. These niche solutions may better cater to the needs inherent to certain project types, but will undoubtedly forgo possible inter-category spillover effects, which are offered by a general purpose platform. To determine whether spillover effects exist, data on 42,996 crowdfunding campaigns with 434,728 distinct pledgers are collected in 24 categories from the current European market leader indiegogo.com. The resulting data are analyzed using graph mining to determine which project categories offer this advantage for the general purpose solution. Insights and recommendations, useful especially for niche platforms seeking to enter the market, are developed.

References

ORDANINI, A., MICELI, L., PIZZETTI, M. and PARASURAMAN, A.(2011): Crowd- funding: Transforming Customers into Investors through Innovative Service Platforms. Journal of Service Management, 4, 443-470 MULLERLEILE,¨ T. and JOENSSEN, D.W. (2014): Key Success-Determinants of Crowd- funded Projects: An Exploratory Analysis. In: H.H. Bock, W. Gaul, M. Vichi and C. Weihs (Eds.): Studies in Classiﬁcation, Data Analysis, and Knowledge Organization. Springer, Berlin. (accepted)

Keywords

SPILLOVER EFFECTS, GRAPH MINING, ENTREPRENEURSHIP

74 14 CON-1F: Machine Learning and Knowledge Discovery II

Wednesday, July 2, 2014: 14:00 - 16:05, West Hall 8 Proﬁt Measure using Objective Oriented Utility, Causality and Domain-Knowledge.

Otilia Alejandro1 and Sylvie Ratte´2

1 Ecole de Technologie Suprieure, ETS. 1100, rue Notre-Dame Ouest, Montreal [email protected] 2 Ecole de Technologie Suprieure, ETS. 1100, rue Notre-Dame Ouest, Montreal [email protected]

Abstract. Most enterprises are concerned with profit. Techniques to identify knowledge affecting profit have generated new research areas like ”profit mining” and ”action feature utility”. In this research, we propose a profit measure based on three elements: the Objective Oriented Utility for association rules (OOA), the causality interpretation, and the day to day expert knowledge; our global objective is to determine the cost of the rules’ attributes. With this new measure, users can distinguish which rules generate more profit, based on technical and domain knowledge. The measure retrieves rules with high immediate profit, rules with long term profit and rules with no profit.

References

WANG, KE; ZHOU, SENQIANG;HAN, JIAWEI (2002): Proﬁt Mining From Patterns to actions.In: C.Jensen, Ł. Simonas, J. KeithG, P. Jaroslav, B. Elisa, B. Klemens,J. Matthias (Eds.):Advances in Database Technology - EDBT 2002. Springer Berlin Heidelberg.70- 87 KLEINGERG, JON; PAPADIMITRIOU, CHISTOS; RAGHAVAN, PRABHAKAR (1998). A Microeconomic View of Data Mining. Journal Data Mining and Knowledge Discovery. JIANG, YUELONG; WANG, KE; TUZHILIN, ALEXANDER; FU ADA WAI-CHEE(2002). Mining Patterns That Respond to Actions. In Proceedings of the ﬁfth IEEE International conference on Data Mining 2005

Keywords

PROFIT-MINING,CAUSALITY,OOA,UTILITY,DOMAIN-KNOWLEDGE

76 Risk Analysis of User Behavior in Online Communities Towards Churn

Philippa A. Hiscock1, Jonathan J. Forster2, Athanassios N. Avramidis3, and Jorg¨ Fliege4

1 University of Southampton, SO17 1BJ, UK [email protected] 2 University of Southampton, SO17 1BJ, UK [email protected] 3 University of Southampton, SO17 1BJ, UK [email protected] 4 University of Southampton, SO17 1BJ, UK [email protected]

Abstract. Businesses supporting online community platforms enable users such as their employees and or customers to engage with like-minded others. Consequently, customer questions and issues are solved, ideas for product development progress, and the platform delivers value to the business. Risk analysis of the online community platform can prevent decrease in community activity and hence value. Churn, a decrease in user activity, is an indicator of decrease in community value as it implies that a user no longer values the support provided by the platform. User churn is expressed as a binary event: either a user decreases in activity (churn); or they continue to behave in a similar manner (do not churn). We present empirical results for risk analysis of the SAP Community Network, http://scn.sap.com/, exploring the interaction between various user features and the likelihood of a user churning.

References

HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J. (2011): The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition). Springer. ISO 31000 (2009): Risk Management – Principles and Guidelines. Technical Report, Interna- tional Standards Organisation, Geneva, 1–24. PREECE, J. (2001): Sociability and Usability in Online Communities: Determining and Mea- suring Success. Behaviour & Information Technology, 20(5), 347–356. ROWE, M., FERNANDEZ, M., ANGELETOU, S., and ALANI, H. (2013): Community Anal- ysis Through Semantic Rules and Role Composition Derivation. Web Semantics: Science, Services and Agents on the World Wide Web, 18(1), 31–47.

Keywords

ONLINE COMMUNITIES, CHURN, RISK ANALYSIS

77 Mining Fuel-Inefﬁcient Driving Behaviors From GPS Trajectories

Josif Grabocka and Lars Schmidt-Thieme

ISMLL, University of Hildesheim, Germany {josif, schmidt-thieme}@ismll.uni-hildesheim.de

Abstract. The rising price of fuel have increased the attempts to reduce transportation costs by adopting driving behaviors that minimize fuel consumption. Eco-Driving refers to the adaptation of drivers’ behaviors with the ultimate aim of minimizing both the fuel consumption and the GHG emissions. Considering the financial and environmental impacts, researchers have approached this optimization problem from a data analysis perspective [SABOOHI, 2009]. Even though a series of methods were developed to identify inefficient driving behaviors, they are mostly based on heuristics triggered by instantaneous behavioral data [KAMAL, 2010]. In contrast to existing work, this paper proposes a novel perspective in analyzing driving behaviors by mining solely GPS measurements, which are cheap and easy to acquire. Local driving patterns are identified from the velocity time series of GPS recordings and are stored in the form of histograms of local polynomials. The most influential patterns with respect to fuel consumption are found by applying a series of greedy forward-search regressions calibrated against fuel estimations. When used as regression predictors, the frequencies of influential patterns reduce the error in predicting the fuel consumption. Experimental results over real- life GPS data demonstrate that patterns including sudden acceleration and deceleration are the most influential driving behaviors with respect to fuel economy.

References

KAMAL, M., MUKAI, M., MURATA, J. and KAWABE, T. (2010): Ecological driver as- sistance system using model-based anticipation of vehicle-road-trafﬁc information. IET Intelligent Transport Systems, Vol. 4, No. 4, 2010, pp. 244-251. SABOOHI, Y., FARZANEH H., (2009): Model for developing an eco-driving strategy of a passenger vehicle based on the least fuel consumption. Applied Energy, Vol. 86, No. 10, 2009, pp. 1925-1932

Keywords

Eco-Driving, GPS Trajectories, Inefﬁcient Driving Behaviors

78 A Signature Based Method for Fraud Detection on E-Commerce Scenarios

Orlando Belo1, Gabriel Mota1, and Joana Fernandes2

1 University of Minho, Portugal, [email protected];[email protected] 2 Farfetch, Portugal, [email protected]

Abstract. Electronic transactions (e-commerce) have revolutionized the way that consumers shop, making the small and local retailers, which were being affected by the worldwide crisis, accessible to the entire world. As e-commerce market expands, commercial transactions supported by credit cards - Card or Customer Not Present (CNP) also increases. This growing relationship, quite natural and expected, has clear advantages, facilitating e-commerce transactions and attracting new possibil- ities for trading. However, at the same time a big and serious problem emerge: the occurrence of fraudulent situations in payments. Fraud imposes severe financial losses, which deeply impacts e-commerce companies and their revenue.. They spend a lot of efforts (and money) trying to establish the most satisfactory solutions to detect and counteract in a timely manner the occurrence of a fraud scenario, in order to minimize losses. In the e-commerce domain, fraud analysts are typically interested in subject oriented customer data, frequently extracted from each order process that occurred in the e-commerce site. Besides transactional data, all their behaviour data e.g. clickstream data are traced and recorded, enriching the means of detection with profiling data and providing a way to trace customers behaviour along time. In this work, we used a signature-based method to establish the characteristics of user behaviour and detect potential fraud cases. Signatures have already been used successfully for anomalous detection in many areas like credit card usage, network intrusion, and in particular in telecommunications fraud. A signature is defined by a set of attributes that receive a diverse range of variables e.g. the average number of orders, time spent per order, number of payment attempts, number of days since last visit, and many others - related to the behaviour of a user, referring to an e-commerce application scenario. Based on the analysis of user behaviour deviation, detected by comparing the user recent activity with the user behaviour data, which is expressed through the user signature, we can detect potential fraud situations (deviate behaviours) in useful time, giving a more robust and accurate support decision system to the fraud analysts on their daily job.

Keywords

E-Commerce, Fraud Detection and Prevention, Clickstream Process- ing, Signatures Based Methods, Usage Proﬁling over E-Commerce Systems, Fraud Detection Applications.

79 Experimental design for estimation of the recovery of deep-water megafaunal assemblages from hydrocarbon drilling disturbance in the Faroe-Shetland Channel

Jones, Daniel1, Baeshen, Marwa2, Miftahuddin, Miftahuddin2, Poupakis, Stavros2, and Lausen, Berthold2

1 National Oceanography Centre University of Southampton, United Kingdom 2 Department of Mathematical Sciences, University of Essex, United Kingdom [email protected]

Abstract. Recovery of megabenthic assemblages from physical disturbance at the Laggan deep-water hydrocarbon drilling site in the FaroeShetland Channel was assessed using re- motely operated vehicle quantitative video survey (Jones et al. 2012). Twelve undisturbed control sites and 2 well sites (A and C, disturbed 3 and 10 yr prior to this work, respectively) were analysed. The megabenthic epifauna at Laggan was dominated by sponges (69.6% total fauna) represented by 20 taxa. Cnidarians (12.8%; 9 taxa) and echinoderms (7.1%; 11 taxa) were also common. After 3 and 10 yr, densities of motile organisms were less variable with distance, except very close to drilling where densities and richness were still reduced. Sessile faunal densities and richness increased signiﬁcantly with increasing distance from drilling in all years, although both metrics were signiﬁcantly higher close to drilling after 3 and 10 yr when compared to immediately after drilling. Using generalised additive models for locatoin, scale and shape (GAMLSS) we suggest a research strategy to establish an experimental design which allows to estimate densities of organisms with a given precision.

References

JONES, D.O.B., GATES, A.R., LAUSEN, B. (2012): Recovery of deep-water megafaunal assemblages from hydrocarbon drilling disturbance in the Faroe-Shetland Channel, Marine Ecology Progress Series 461: 7182.

Keywords

GENERALISED ADDITIVE MODELS fOR LOCATION, SCALE AND SHAPE (GAMLSS)

80 15 CON-2A: Machine Learning and Knowledge Discovery III

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 2 Feature selection for additive kernel classiﬁers

Surette Bierman, Nelmarie Louw and Sarel Steel

Stellenbosch University, South Africa [email protected]

Abstract. Kernel classifiers such as support vector machines, kernel Fisher discriminant analysis and kernel logistic regression are known to yield good classification results in a wide array of application domains. In order to fit a kernel classifier, a kernel function needs to be specified. Depending on run-time constraints and the expected form of the true decision boundary, linear or non-linear kernel functions may be used. Generally the use of non-linear kernels leads to superior classification performance. In contrast, linear kernel classifiers are popular for real-time applications since they are faster to implement. Additive kernels are more general than linear kernels, typically yielding better classification accuracy than linear kernels. Moreover, additive kernels do not suffer from run-time complexity as is the case with non-linear kernels. We present a method for feature selection in the context of additive kernel classifiers. The performance of the proposed selection procedure is compared to those of other selection strategies in the literature. Important properties of the technique, together with avenues for further improvements and extensions are discussed.

References

MAJI, S., BERG, A.C., and MALIK, J. (2013): Efficient classification for additive kernel SVMs. IEEE Transactions on pattern analysis and machine intelligence, 35(1), 66–77. STEEL, S.J., LOUW, N. and BIERMAN, S. (2011): Variable selection for kernel classification. Communications in Statistics: Simulation and Computation, 40, 241–258.

Keywords

ADDITIVE KERNELS, FEATURE SELECTION, KERNEL CLAS- SIFICATION

82 Utilizing semantics for guiding multi-classiﬁer systems

Ludwig Lausser, Florian Schmid, Johann Kraus, Axel Furstberger,¨ and Hans A. Kestler∗

Medical Systems Biology, Institute of Neural Information Processing, Ulm University, 89069 Ulm, Germany {ludwig.lausser, florian1.schmid, johann.kraus, axel.fuerstberger, hans.kestler}@uni-ulm.de ∗ corresponding author

Abstract. Marker selection is an essential step in developing interpretable prognostic or diagnostic methods. It affects the accuracy as well as the interpretability of a model. While purely data driven marker selection algorithms are mainly designed for improving the accuracy of a model, they often do not allow for constructing high-level hypotheses (e.g. an explanation in terms of pathways). External meta-information is needed to respect functional dependencies among different markers. In this work we extend our recently proposed approach of incorporating meta-information into the training of multi-classifier systems. This knowledge-based approach directly constructs decision rules from abstract functional or structural terms (e.g. GO terms or KEGG pathways). It is based on classifiers operating on interpretable signatures, which are known to be associated with some high-level terms. Here, we focus on the selection and construction of annotations that are possibly related to the topic of a classification task. We utilize semantic technologies for identifying possible candidate terms and for narrowing the search space for the construction of new explanatory hypotheses.

Keywords

FEATURE SELECTION, MULTI-CLASSIFIER SYSTEMS, SEMAN- TICS, BIOINFORMATICS

83 Characterizing feature selection algorithms

Lyn-Rouven Schirra, Lausser Ludwig, and Hans A. Kestler∗

Medical Systems Biology, Institute of Neural Information Processing, Ulm University, 89069 Ulm, Germany {lyn-rouven.schirra, ludwig.lausser, hans.kestler}@uni-ulm.de ∗ corresponding author

Abstract. The classification of high-dimensional gene expression profiles can lead to the identification of diagnostic features for distinguishing similar symptomatic phenotypes. In this setting, the development of diagnostic models is mainly coupled to the interest in low- dimensional and interpretable decision rules. Feature selection is an essential preprocessing step in dealing with such high-dimensional data. Besides a dimensionality reduction these methods provide a list of selected markers that can be used for generating new biological hypotheses. We present a comparative study of purely data driven selection methods. These algorithms can be applied to a given dataset without interaction with a subsequent classifier. The feature selection methods are examined regarding to classifier independent properties such as stability, pairwise similarity and granularity. The analyses are used to distinguish different subgroups of algorithms and to identify valuable properties for the generalization ability of different classification models.

Keywords

FEATURE SELECTION, DIMENSIONALITY REDUCTION, STA- BILITY ANALYSIS, BIOINFORMATICS

84 Network and Data Integration for Biomarker Signature Discovery via Network Smoothed T-Statistics

Yupeng Cun1,2 and Holger Frohlich¨ 2

1 University of Cologne, Department of Translational Genomics, Weyertal 115b, 50931 Cologe, Germany [email protected] 2 University of Bonn, Bonn-Aachen International Center for IT, Dahlmannstr. 2, 53113 Bonn, Germany [email protected]

Abstract. Predictive, stable and interpretable gene signatures are generally seen as an important step towards a better personalized medicine. During the last decade various methods have been proposed for that purpose. However, one important obstacle for making gene signatures a standard tool in clinics is the typical low reproducibility of signatures combined with the difficulty to achieve a clear biological interpretation. For that purpose in the last years there has been a growing interest in approaches that try to integrate information from molecular interaction networks. We here propose a technique that integrates network information as well as different kinds of experimental data (here exemplified by mRNA and miRNA expression) into one classifier. This is done by smoothing t-statistics of individual genes or miRNAs over the structure of a combined protein-protein interaction (PPI) and miRNA-target gene network via a random walk kernel. A permutation test is conducted to select features in a highly consistent manner, and subsequently a linear Support Vector Machine (SVM) classifier is trained. Com- pared to several other competing methods our algorithm reveals an overall better prediction performance for early versus late disease relapse and a higher signature stability. Moreover, obtained gene lists can be clearly associated to biological knowledge, such as known disease genes and KEGG pathways. We demonstrate that our data integration strategy can improve classification performance compared to using a single data source only. Our method, called stSVM, is available in R-package netClass on CRAN (http://cran.r-project.org) [?]. This abstract is a short summary of an article that has been recently published in the open access journal PLoS ONE [?].

Keywords biomarker signature discovery, personalized medicine, SVM, data integration, biological networks

85 Minimizing Redundancy among Genes Selected Based on the Overlapping Analysis

Osama Mahmoud1, Andrew Harrison1, Asma Gul1, Zardad Khan1, Metodi V. Metodiev2, and Berthold Lausen1

1 Department of Mathematical Sciences, University of Essex, UK. 2 School of Biological Sciences/Proteomics Unit, University of Essex, UK.

Abstract. For many functional genomic experiments, identifying the most characterizing genes is a main challenge. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on a set of discriminative genes. Analyzing overlapping between gene expression of different classes is an effective criterion for identifying relevant genes (Apiletti et al., 2012). However, genes selected according to maximizing a relevance score could have rich redundancy. We propose a scheme for minimizing selection redundancy, in which the Proportional Overlapping Score (POS) technique (Mahmoud et al., 2014) is extended by using a recursive approach to assign a set of complementary discriminative genes. The proposed scheme exploits the gene masks defined by POS to identify more integrated genes in terms of their classification patterns. The approach is validated by comparing its classification performance with other feature selection methods, Wilcoxon Rank Sum, mRMR, MaskedPainter and POS, for several benchmark gene expression data sets using three different classifiers: Random Forest; k Nearest Neighbour; Support Vector Machine. The experimental results of classification error rates show that our proposal achieves a better performance.

References

APILETTI, D., BARALIS, E., BRUNO, G., FIORI, A. (2012): Maskedpainter: Feature Selection for Microarray Data Analysis. Intelligent Data Analysis 16(4), 717–737. DING, C., PENG, H. (2005): Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Journal of Bioinformatics and Computational Biology 3(02), 185–205. MAHMOUD, O., HARRISON A., PERPEROGLOU A., GUL A., KHAN Z., METODIEV M. and LAUSEN B. (2014): A Gene Selection Method for Classiﬁcation within Functional Genomics Experiments Based on the Proportional Overlapping Score. SUBMITTED FOR PUBLICATION.

Keywords

FEATURE SELECTION, REDUNDANT GENES, CLASSIFICATION, GENE MASKS.

86 16 CON-2B: Data Analysis in Social Sciences I

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 3 The Effects of Parenthood on Well-Being

Collin Vance1 and Evgenia Samoilova2

1 RWI Essen/Jacobs University [email protected] 2 GESIS, Cologne/BIGSSS [email protected]

Abstract. According to the recent reviews of the topic by Hansen (2012), Stanca (2012), and Blanchflower (2008), international cross-sectional and longitudinal studies converge in their conclusions that children have mostly a negative effect on life satisfaction. This paper investigates the relationship between parenthood and life satisfaction using longitudinal data on mothers and fathers from the German Socio-Economic Panel. To our knowledge, this is the first paper that takes into account the comprehensive life cycle of parenthood as well as the first contribution on the topic of happiness, where quantile regression has been used to analyze panel data. Among others, in our findings we observe that life-cycle and settings of parenthood are highly relevant for the positive gains of children. Yet, the effects of the life-cycle differ according to the happiness distribution. Among both mothers and fathers, both happiest and unhappiest individuals experience least negative and least positive effects of parenthood in comparison to other quantiles.

References

Blanchﬂower, D. G. (2009): International Evidence on Well-being. In: A. B. Krueger (Eds):Measuring the Subjective Well-Being of Nations: National Accounts of Time Use and Well-Being. University of Chicago Press, Chicago, 155–226. Hansen T (2012): Parenthood and Happiness: A Review of Folk Theories versus Empirical Evidence. Social Indicators Research, 108(1), 29–64. Stanca, L. (2012): Suffer the Little Children: Measuring the Effects of Parenthood on Well- Being Worldwide.Journal of Economic Behavior and Organization, 81(3), 742-750.

Keywords

HAPPINESS, PARENTHOOD, QUANTILE REGRESSION, PANEL DATA

88 How health literacy facilitates healthy lifestyle habits: Analyses of data from an online study

Juliane Paech1 and Sonia Lippke2

1 Jacobs University Bremen, Jacobs Center on Lifelong Learning (JCLL), Campus Ring 1, 28759 Bremen, Germany, [email protected] 2 Jacobs University Bremen, Jacobs Center on Lifelong Learning (JCLL), Campus Ring 1, 28759 Bremen, Germany, [email protected]

Abstract. Physical activity is essential for healthy ageing but many people do not know this or fail to translate their intention into action. Health literacy can be an important resource to overcome barriers to physical activity by enabling social support and self-regulatory strategies. The present study examines which contribution received social support, planning and self- regulation in facilitating physical activity make. An online study with three measurement points was conducted, intention was assessed at baseline, planning and social support at 4-week follow-up, self-regulation and physical activity at 6-month follow-up. A path model was analyzed modeling intention, received support, planning and self-regulation to predict physical activity. Received support, planning and self-regulation mediated the link from intention to physical activity, indirect effects were also signiﬁcant. The proposed model was supported. Health literacy components such as received social support, planning and self-regulation facilitated physical activity. To support active and healthy ageing health literacy should be enhanced especially in vulnerable individuals.

References

LIPPKE, S. & ZIEGELMANN, J. P. (2008), Theory-based health behavior change: Develop- ing, testing, and applying zheories for evidence-based interventions.Applied Psychology: An International Review, 57: 698716. doi: 10.1111/j.1464-0597.2008.00339.x SCHWARZER, R. (2008), Modeling health behavior change: How to predict and modify the adoption and maintenance of health behaviors. Applied Psychology: An International Review, 57: 129. doi: 10.1111/j.1464-0597.2007.00325.x ZIEGELMANN, J. P. & LIPPKE, S. (2007). Use of selection, optimization, and compensation strategies in health self-regulation: Interplay with resources and suvvessful development.J Aging Health, 19, 500-518. doi: 10.1177/0898264307300197

Keywords ABSTRACTS, GUIDELINES, LAYOUT, REFERENCES

89 Validation of questionnaires using a pilot trial and the English Longitudinal Study of Ageing

Florea, Adi1,2, Gage, Faith3, Head, Samantha3, Reynolds, Terri3, Marsland, Louise4, Jackson, Joanna4, and Lausen, Berthold1

1 Department of Mathematical Sciences, [email protected] 2 Department of Psychology, University of Essex, United Kingdom 3 Colchester Hospital University NHS Foundation Trust, United Kingdom 4 School of Health and Human Sciences, University of Essex, United Kingdom

Abstract. Urinary incontinence is a distressing condition affecting more than 5 million women in the UK. Treatment usually involves pelvic floor exercises, which appears to have no significant side effects and enables improvement in symptoms (Mantle & Versi, 1991). More recently Modified Pilates (MP) has been suggested as an additional means of reducing the severity of symptoms and the improvement of the quality of life of sufferers (Head et al. 2013). In a pilot study 74 women were randomly assigned to two groups: Group 1 received pelvic floor exercises and lifestyle advice only; Group 2 attended a 6 week course of MP classes in addition to receiving pelvic floor exercises and lifestyle advice. Participants an- swered questionnaires on quality of life, self-esteem and symptom severity at baseline (T1), after the MP classes (T2) and 5 months after randomisation (T3). The paper analysed the observed baseline data (T1) in relation to measurements such as CASP-19 of the English Lon- gitudinal Study of Ageing (ELSA www.elsa-project.ac.uk) to validate the pilot trial questionnaires and to suggest possible instruments that might be used to measure primary and secondary endpoints for a planned full trial.

References

MANTLE, J., VERSI, E. (1991). Physiotherapy for stress urinary incontinence. A national survey. British Medical Journal, 302 (6779), 753 - 755. HEAD, S., GAGE, F., JACKSON, J., LAUSEN, B., MARSLAND, L. (2013): Modiﬁed Pilates as an Adjunct to Standard Physiotherapy Care for Urinary Incontinence: A Pilot Study. Funded by the National Institute for Health Research (NIHR) under its Research for Patient Beneﬁt Programme. Grant Reference Number PB-PG-1010-23220. University Hospital Colchester, Colchester, UK.

Keywords

VALIDITY OF QUESTIONNAIRES, CASP-19

90 Student Life-Style Revisited - Values, Attitudes and Behavior

Andreas Geyer-Schulz, Thomas Hummel and Victoria-Anne Schweigert1

Institute of Information Systems and Marketing (IISM), Karlsruhe Institute of Technology (KIT), Kaiserstrae 12, 76131 Karlsruhe {andreas.geyer-schulz, thomas.hummel, victoria-anne.schweigert}@kit.edu

Abstract. In this contribution we report on a pre-test of student life-style scales derived from Rokeach’s value survey [Rokeach1973], Mitchell’s values and life-style (VALS)[Mitchell1983] as well as Kahle’s list of values (LOV) [Kahle1983]. Technological change (especially the world wide web) and new trends, e.g. in food, recreation and drug consumption, lead to the requirement of changing the original scales in such a way that these trends are taken into account. We present and critically discuss the resulting modiﬁcations in this contribution. Methodologically, we concentrate on a comparison of the different operationalizations of the same latent constructs in the instruments presented above: Rokeach’s value survey is based on ranking, Mitchell’s VALS survey on rating data, and Kahle experimented with both types of operationalization.

References

KAHLE, Lynn R. (1983): Social Values and Social Change: Adaption to Life in America. Praeger, New York. MITCHELL, Arnold (1983): The Nine American Life Styles: Who We Are and Where We’re Going. Macmillan, New York. ROKEACH, Milton (1973): The Nature of Human Values. Free Press, New York.

Keywords

LIFE STYLE, VALUES, SCALES,

91 MultiTrait-MultiMethod (MTMM) and CFA model in comparative analysis of 5, 7, 9 and 11 point scales

Piotr Tarka1

Poznan University of Economics, Department of Marketing Research, Poland [email protected]

Abstract. In article author conducted comparative analysis of the Likert rating scale based on different response categories (i.e., 5, 7, 9 and 11 point). In consequence, one tried to find an optimum range of categories on the scale. In reference to these scales, we used empirical example based on the attitude of young consumers (n = 200), which were studied in the area of marketing ethical behavior of companies. In first part of analysis we used descriptive statistics, then we applied MTMM in context of Confirmatory Factor Model - CFA. MTMM allowed for comparison of matrices (with the assumption of various traits / items and different methods i.e., the respective categories of the scales, which were applied for their measurement). When using MTMM we checked the overall convergence and divergence in the obtained results. As proof of convergence, we identified high correlations of the same traits measured by various methods. High convergence reflected stability of the measured traits regardless of the number of response categories on scale. In the end, because visual inspection of the pattern correlation matrices with MTMM method might be prone to errors, we used the CFA model, which provided an objective way of evaluating the matrices.

References

CAMPBELL, D.T., FISKE, D.W. (1959): Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix. Psychological Bulletin, 56, 81–105. MARSH, H.W. (1989): Conﬁrmatory Factor Analyses of Multitrait-Multimethod Data - Many Problems and a few Solutions. Applied Psychological Measurement, 13, 335–361. TOMAS, J.M., HONTANGAS, P.M., OLIVER, A (2000): Linear Conﬁrmatory Factor Models to Evaluate Multitrait-Multimethod Matrices Multivariate Behavioral Research, 35, 469– 499.

Keywords

MTMM, CONSUMERS, CFA MODEL

92 17 CON-2C: Statistics and Data Analysis II

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 4 The analysis of incomplete multi–way tables with the use of log–linear models

Justyna Brzezinska´

Faculty of Management University of Economics in Katowice, 1 Maja 50, 40– 287 Katowice, Poland [email protected]

Abstract. Variety of social, medical, psychological and biological science data come in the form of cross-classified table counts referred to as contingency table. Such table gives the observed counts simultaneously for the categories of two– or more categorical variables. The data to be classified in the contingency table can be split into the fully classified cases where information on all the categories is available (complete tables), or the partially classified cases where information on some of the categories is zero (zero cells tables). Tables containing zeros include two types of zeros: sampling (random) zeros and structural (fixed) zeros. Several options for the analyzis of table with zero cells will be presented. Log–linear analysis will be conducted and the results for different adjustmenst for zero cell will be presented. All calculations will be conducted in R.

References

CHRISTENSEN, R.(1997): Log-linear Models and Logistic Regression. Springer–Verlag, New York. KNOKE D., BURKE P.(1997): Log-linear Models. Sage University Paper Series on Quanti- tative Applications in the Social Science, series no. 07-020, Beverly Hills and London Sage.

Keywords

INCOMPLETE TABLES, ZERO CELLS CONTINGENCY TABLES, LOG–LINEAR MODELS.

94 The Weight of Penalty Optimization for Ridge Regression

Sri Utami Zuliana and Aris Perperoglou

Department of Mathematical Sciences, University of Essex, Colchester, CO4 3SQ, UK [email protected], [email protected]

Abstract. Ridge regression is a method of biased estimation with the purpose to reduce variance, and deal with multicollinearity. In this method, a penalty is added to the likelihood of the regression model. Depending on the weight of that penalty, coefﬁcient estimates are shrunk towards zero. There are several methods to choose a penalty weight, such as Akaike’s criterion, Bayesian criterion or generalized cross validation. All of these methods are based on a grid search for the optimal penalty weight, which makes them computationally expensive. In this work, an algorithm to estimate a penalty weight will be illustrated, starting from an arbitrary initial value and iterating within fewer steps. This algorithm rises from a Bayesian perspective and was introduced by Schall as a method to estimate the variance of random effect. The theory, an application to data, and simulation studies will be presented.

References

HOERL, A. E. and KENNARD, R. W. (1970): Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55-67. SCHALL, R. (1991): Estimation in Generalized Linear Models with Random Effects. Biometrika, 78(4), 719-727.

Keywords

PENALIZED REGRESSION, SCHALL’S ALGORITHM, BAYESIAN

95 Correlated component regression: Proﬁling student performances by means of background characteristics

Bernhard Gschrey and Ali Unl¨ u¨

Chair for Methods in Empirical Educational Research, TUM School of Education, and Centre for International Student Assessment (ZIB), TU Munchen,¨ Marsstr. 20-22, 80335 Munich, Germany {bernhard.gschrey,ali.uenlue}@tum.de

Abstract. Multicollinearity is one of the main problems when using regression analytic approaches to predict outcome variables. The application of traditional regression analytic approaches often provides unstable and unreliable estimates of the parameters when multicollinearity occurs. This is especially true in Large Scale Assessments such as PIRLS/TIMSS or PISA, where the number of variables is high. In this paper we apply a regression analytic method called correlated component regression (CCR), developed by Magidson (2013), for characterizing student performances in PIRLS/TIMSS 2011 (Martin & Mullis 2013; www.timssandpirls.bc.edu) through selected background characteristics, such as cultural and socio-economic characteristics. On the basis of various criteria, we compare the ﬁndings of CCR with the results of ordinary regression analysis (OLS) regarding the prediction of student performance values. An implemented cross-validation procedure and step-down algorithm are utilized to perform a special type of variable reduction of the most relevant predictor components. Thus, the results of our study will provide more reliable and better interpretable sets of background variables for characterizing large scale educational data in the domains reading, mathematics and science.

References

MAGIDSON, J. (2013): Correlated component regression: Re-thinking regression in the presence of near collinearity. In: New Perspectives in Partial Least Squares and Related Meth- ods. Springer, Heidelberg. MARTIN, M.O. & MULLIS, I.V.S. (2013): Methods and Procedures in TIMSS and PIRLSS 2011. TIMSS & PIRLS International Study Center, Chestnut Hill.

Keywords

CORRELATED COMPONENT REGRESSION, HIGH DIMENSIONAL DATA, LARGE SCALE EDUCATIONAL ASSESSMENT, MULTI- COLLINEARITY

96 Data Envelopment Analysis for City Efﬁciency

Daniel Reißmann, Iris Lehmann, Jorg¨ Hennersdorf, Clemens Deilmann, and Martin Behnisch

Leibniz Institute of Ecological Urban and Regional Development, Weberplatz 1, 01217 Dresden. [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Extensive research has been oriented to a quantitative understanding of the growth of cities, the economy of scale and the social and environmental implications. Within the actual debate on sustainable development of cities the general concept of Resource Efficiency (enhancing the quality of life while minimizing resource consumption) revives the discussion. In order to try to understand and concretize these questions and to enrich the debate, an attempt has been made to apply a method commonly used in the field of economics to measure efficiency, namely Data Envelopment Analysis (DEA), to the study of cities. The DEA is a non-parametric, deterministic method to measure the efficiency of economic production, in which the relative efficiency of Decision Making Units (DMUs) is calculated. An examination of efficiency in 116 cities throughout Germany was undertaken to test the usefulness of the DEA for efficiency analysis of cities. The investigation included the elaboration of separate, simple economic and ecological models in order to allow more precise identification of the relevance of individual parameters during the evaluation process. Hypotheses were established for both models. The inputs and outputs were selected to best illustrate the expected correlations. The results allowed a ranking of cities as well as an estimation of the ratios of economic and ecological efficiencies of the investigated cities. The DEA appears to be a highly promising heuristic tool with which to draw the basic outlines of a resource efficient city and shed light on phenomenons and relations of factors.

References

[]OGETOFT, P. (2011): Benchmarking with DEA, SFA, and R. Springer, New York. []ILSON, P. W. (2008): FEAR: A software package for frontier efficiency analysis with R. Socio-Economic Planning Sciences, 42(4), 247–254. []UOSMANEN, T. and KORTELAINEN, M. (2005). Measuring eco-efficiency of production with data envelopment analysis. Journal of Industrial Ecology, 9(4), 59–72. []IAN, Y. and YANG, F. (2010). Resource and environment efficiency analysis of provinces in China: A DEA approach based on Shannons entropy. Energy Policy, 38, 1909–1917. []EHMANN, I. HENNERSDORF, J. and DEILMANN C. (2010). Effizienzbewertung von Stadten¨ auf der Grundlage von Data Envelopment Analysis. DISP, 49(1), 44–53.

97 98 Authors Suppressed Due to Excessive Length Keywords

DEA, Ecological Efﬁciency, Economical Efﬁciency, Cities, Spatial Planning Optimization of a Simulation for Inhomogeneous Mineral Subsoil Machining

Swetlana Herbrandt1, Claus Weihs1, Manuel Ferreira2 and Christian Rautert3

1 TU Dortmund, Department of Statistics, Dortmund, Germany [email protected], [email protected] 2 TU Dortmund, Institute of Materials Engineering, Dortmund, Germany [email protected] 3 TU Dortmund, Institute of Machining Technology, Dortmund, Germany [email protected]

Abstract. For the new generation of concrete which enables more stable constructions we require more efficient tools. Since the preferred tool for machining concrete is a diamond impregnated drill with substantial initial investment costs, the reduction of tool wear is of special interest. The stochastic character of the diamond size, orientation and position in sintered segments as well as differences in the machined material justifies the development of a statistical motivated simulation. In the simulation presented in the past, workpiece and tool are subdivided by Delaunay tessellations into predefined fragments. The heterogeneous nature of the ingredients of concrete is solved by Gaussian Random Fields. Before proceeding with the simulation of the whole drill core bit, we have to adjust the simulation parameters for the two main components of the drill, diamond and metal matrix, by minimizing the discrepancy between simulation results and the conducted experiments. Due to the fact that our simulation is an expensive black box function with stochastic outcome and constrained parameters, we use the advantages of model based optimization methods.

Keywords statistical simulation, Delaunay tessellation, Gaussian random ﬁelds, model based optimization, subsoil machining

99 18 CON-2D: Data Analysis in Social Sciences II

Thursday, July 3, 2014: 10:30 - 12:35, West Hall 6 Utilization of Panel Data Analysis to Predict the Risk of Poverty of EU Households

Maria´ Stachova´1 and Luka´sˇ Sob´ısekˇ 2

1 Faculty of Economics, Matej Bel University, Tajovskeho´ 10, 975 90 Banska´ Bystrica, Slovakia [email protected] 2 Faculty of Informatics and Statistics, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Prague 3, Czech Republic [email protected]

Abstract. One of the main approaches to follow the causality among income, social inclusion and living conditions is based on regression models estimated using various statistical methods. This approach takes into account quantitative and qualitative information about individuals or households that are collected in different time periods, e.g. years, and thus we can transform them in multi-dimensional data sets, called panel data. Panel data regression models can describe dynamics over time periods, so that we can relate the patterns e.g. the risk of poverty rate to changes in other characteristics. Nowadays, panel data analysis becomes very popular and it has been founded as a useful tool, mainly through software development and powerful computers availability. In our contribution, we present and compare different approaches to panel data analysis, namely RE-EM trees and mixed effect models to predict the risk of poverty rate. To estimate the parameters of the models, we also employ sampling and resampling methods.

References

SELA, R. J. and SIMONOFF, J. S. (2011): RE-EM Trees: A Data Mining Approach for Lon- gitudinal and Clustered Data. Machine Learning, 86, 169–207. PINHEIRO, J. and BATES, D. (2009): Mixed-Effects Models in S and S-PLUS. Springer, New York. BALTAGI, B.H. (2012):Econometric Analysis of Panel Data. Wiley, Chichester.

Keywords

PANEL DATA, MIXED MODELS, RISK OF POVERTY, RE-EM TREES

101 Applying the Fuzzy Set Theory to Identify the Non-monetary Factors of Poverty

Marta Dziechciarz-Duda1 and Klaudia Przybysz2

1 Wroclaw University of Economics [email protected] 2 Wroclaw University of Economics [email protected]

Abstract. There is a practical problem of identifying the poverty and the measurement of the level of poverty. It can be seen many methods of measuring poverty and identifying the poor people in the literature and research. A relatively new approach (see: Panek 2011) seems to be taking into account the multi-dimensionality by applying the fuzzy set theory to measurement of poverty. This allows you to define the degree of membership to a group of poverty or not poverty. The main goal of this article is to implement the fuzzy set theory approach to the definition and evaluation of poverty. It is assumed possible to identify psychological factors, generally classified as unmeasurable.This can be done by identifying the differences, for example in the perception of situation of households, which in varying degrees belong to the group of the poor. As you know, a lot of families who are entitled to apply for aid, do not use it. Identification of the factors determining the behavior of households considered poor, can be a main element in the creation of future instruments of social policy. The study was conducted on data from the Social Diagnosis 2013.

References

[1]ALKIRE S. FOSTER J. (2007): Counting and Multidimensional Poverty Measure- ment. OPHI Working Paper Series; www.ophi.org.uk. [2]BATTISTON D. CRUCES G. LOPEZ-CALVA L. LUGO M. SANTOS M. (2009): Income and beyond: multidimensional poverty in six Latin America. OPHI Working Paper 17; www.ophi.org.uk. [3]LEMMI A. BETTI G. (ed.) (2006): Fuzzy Set Approach to Multidimensional Poverty Mea- surement . Springer Science Business Media, LCC, New York. [4]PANEK T. (2011): Ubstwo, wykluczenie spoeczne i nierwnoci. Teoria i praktyka pomiaru . Oﬁcyna Wydawnicza SGH, Warszawa.

Keywords

POVERTY,FUZZY SET THEORY,MULTIDIMENSIONAL POVERTY

102 Ordered logistic model as a tool to identify the determinants of poverty risk among Polish households

Andrzej Wołoszyn1, Izabela Kurzawa2 and Romana Głowicka-Wołoszyn2

1 University School of Physical Education in Poznan [email protected] 2 Poznan University of Life Sciences [email protected], [email protected]

Abstract. Poverty remains one of the major social and economic problems of the contemporary societies. The improvement of the living conditions of the less privileged social groups is one of the main objectives of modern concepts of economic development, as well as the key point of the EU cohesion policy. The determination of the extent of poverty, its depth and causes is essential for the adoption and implementation of an effective social policy of any country. Hence, the study of poverty continues to be a focal point of research endeavors of many a university investigator, a government welfare department or an international economic organization, such as the World Bank. Since joining the EU in 2004, Poland has witnessed a substantial and positive change in its households living conditions. However, the risk-of-poverty indicators, relative or absolute, dropped only slightly, with the former being now close to the EU average. The paper aims to conduct a multidimensional analysis of poverty among the Polish households in 2005 and 2010. It seeks to compare the extent of relative and absolute poverty and identify its determinants, as well as the strength and direction of their impact. For this analysis it employs multinomial logit models for ordered categories. The data for the study has been taken from the individual microdata of the 2005 and 2010 Household Survey (of 34767 and 34412 households, respectively) conducted by the Polish Central Statistical Ofﬁce.

References

Borooah, V. K. (2001): Logit and probit: Ordered and multinomial models. Sage University Paper series on Quantitative Applications in the Social Sciences. Brant, R. (1990): Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics, 46, 4. Greene, W.H. and Hensher, D.A. (2008): Modeling Ordered Choices: A Primer and Recent Developments. SSRN. Hilbe, J.M. (2009): Logistic Regression Models. Chapman & Hall CRC Press, Boca Raton. Liao, T.F. (1994): Interpreting probability models: Logit, probit, and other generalized linear models.Sage University Paper series on Quantitative Applications in the Social Sciences. Long, J.S. and Freese, J. (2006): Regression models for categorical dependent variables using Stata (edition). Stata Press Publication, College Station, Texas.

103 104 Andrzej Wołoszyn, Izabela Kurzawa and Romana Głowicka-Wołoszyn

Williams, R. (2006): Generalized Ordered Logit/ Partial Proportional Odds Models for Ordinal Dependent Variable. The Stata Journal 6(1),58-82. Wolfe, R. and Gould, W. (1998): An approximate likelihood-ratio test for ordinal response models. Stata Technical Bulletin, 7(42).

Keywords ordered logistic model, determinants of poverty Multivariate Logistic Mixtures

Xiao Liu and Ali Unl¨ u¨

Chair for Methods in Empirical Educational Research, TUM School of Education and Centre for International Student Assessment (ZIB), TU Munchen,¨ Arcisstr. 21, 80333 Munich, Germany {x.liu,ali.uenlue}@tum.de

Abstract. There is the work by Ray and Lindsay (2005) on the key features of multivariate normal mixtures, including the determination of the number of modes and general modality theorems. For the logistic distribution, on the other hand, such information seems to be lacking. The logistic distribution plays an important role in psychometrics for instance, for modeling item response functions (Reckase (2009)). In this paper, we propose analogs of the multivariate normal mixture results for the multivariate logistic distribution (Malik and Abraham (1973)). Unlike the mixture of multivariate normal distributions, for the logistic case it seems infeasible to express the ridgeline function explicitly. However, applying the implicit function theorem, we can prove that a unique explicit formula is possible locally. Moreover, we focus on displaying the elevation of the logistic mixture density on the ridgeline and address a technique called the Π-plot, both of which carry important information about modality properties of the mixture. We conclude with remarks about the similarities and differences between the multivariate normal and logistic distributions in regards to their mixture properties and conclusions thereof.

References

MALIK, H. J. and ABRAHAM, B. (1973): Multivariate Logistic Distributions. The Annals of Statistics, 1, 588–590. RAY, S. and LINDSAY, B. G. (2005): The Topography of Multivariate Normal Mixtures. The Annals of Statistics, 33, 2042–2065. RECKASE, M. D. (2009): Multidimensional Item Response Theory. Springer, New York.

Keywords

LOGISTIC MIXTURE, MULTIVARIATE MODE, RIDGELINE, Π- PLOT, ITEM RESPONSE THEORY

105 Fast DD-classiﬁcation of functional data

Karl Mosler1 and Pavlo Mozharovskyi2

1 University of Cologne, Albertus Magnus Platz, 50923 Cologne, Germany [email protected] 2 University of Cologne, Albertus Magnus Platz, 50923 Cologne, Germany [email protected]

Abstract. A fast nonparametric procedure for classifying functional data is introduced. It consists of a two-step transformation of the original data plus a classifier operating on a low-dimensional hypercube. The functional data are first mapped into a finite-dimensional location-slope space and then transformed by a multivariate depth function into the DD-plot (Li et al., 2012), which is a subset of the unit hypercube. This transformation yields also a new notion of depth for functional data. Three alternative depth functions are employed for this, as well as two rules for the final classification on [0,1]q (Lange et al., 2014). The entire methodology does not involve smoothing techniques and is completely nonparametric. It is robust, efficiently computable, and has been implemented in an R environment. The new procedure is compared with known ones, including the componentwise approach of Delaigle et al. (2012), and its applicability is demonstrated by simulations as well as a benchmark study.

References

DELAIGLE, A., HALL, P. AND BATHIA, N. (2012): Componentwise classification and clustering of functional data. Biometrika, 99, 299–313. LANGE, T., MOSLER, K. AND MOZHAROVSKYI, P. (2014): Fast nonparametric classification based on data depth. Statistical Papers, 55, 49–69. LI, J., CUESTA-ALBERTOS, J.A. AND LIU, R.Y. (2012): DD-classifier: Nonparametric classification procedure based on DD-plot. Journal of the American Statistical Associ- ation, 107, 737–753.

Keywords

FUNCTIONAL DEPTH, SUPERVISED LEARNING, CENTRAL RE- GIONS, LOCATION-SLOPE DEPTH, DD-PLOT, ALPHA-PROCEDURE, BERKELEY GROWTH DATA, MEDFLIES DATA.

106 19 CON-3A: Machine Learning and Knowledge Discovery IV

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 2 Monitoring dynamic weighted majority method with Adaptive control chart based on real datasets with concept drift

Dhouha Mejri1, Mohamed Limam2 and Claus Weihs3

1 Technische Universitat¨ of Dortmund, ISG Tunis, University of Tunis mejri [email protected] 2 ISG Tunis, University of Tunis [email protected] 3 Technische Universitat¨ of Dortmund [email protected]

Abstract. Monitoring changes during a learning process is an interesting area of research in several online applications. The most important problem is how to detect and explain these changes so that the performance of the learning model can be controlled and maintained. En- semble methods have been perfectly coped with concept drift. This paper presents an online classification ensemble method designed for concept drift entitled Dynamic Weighted Major- ity Algorithm (DWM). It adds and removes experts based on their performance and adjusts learner’s weights taking into account their age in the ensemble as well as their historical correct prediction. The idea behind this paper is to monitor the classification error rates of DWM based on a time adjusting control chart which adjusts the CLs each time an adjustment condition is satisfied. Moreover, this paper handles with different real datasets for concept drift and analyses the impact of the diversity of the base classifiers and how do they deal with non stationary environment. Experiments have shown that monitoring the classification errors improves the probability of drift detection.

References

MEJRI, D., KHANCHEL, R. and LIMAM, M. (2013): An Ensemble method for concept drift in nonstationary environment. Journal of Statistical Computation and Simulation, 83, Issue 6, 1115–1128. KOLTER, Z.J. and MALOOF, M. A. (2007): Dynamic weighted majority: An ensemble Method for Drifting Concepts. Journal of Machine Learning Research, vol.8, num.13, 2755-2790.

Keywords

ONLINE CLASSIFICATION, ENSEMBLE METHODS, ADAPTIVE CONTROL CHARTS, CONCEPT DRIFT.

108 Ensemble of k-Nearest Neighbour Classiﬁers for Class Membership Probability Estimation

Asma Gul1, Zardad Khan1, Osama Mahmoud1, Miftahuddin1, Werner Adler2, Aris Perperoglou1 and Berthold Lausen1

1 Department of Mathematical Sciences, University of Essex, Colchester, CO4 3SQ, UK. 2 Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, Germany.

Abstract. Combining multiple classifiers can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets (Breiman, 1996). This technique can also be used for estimating class membership probabilities (Kruppa at al., 2012). We propose an ensemble of k Nearest Neighbours (kNN) classifiers for class membership probability estimation in the presence of non-informative features in the data. This is done in two steps. Firstly, we select classifiers based upon their individual performance from a set of base kNN models, each generated on a bootstrap sample using a random feature set from the feature space of training data. Secondly, a step wise selection is used on the selected learners, and those models are added to the ensemble that maximize its predictive performance. We use bench mark data sets with some added non-informative features for the evaluation of our method. Experimental comparison of the proposed method with usual kNN, bagged kNN, random kNN and Random Forest shows that it leads to high predictive performance in terms of minimum brier score on most of the data sets (Hothorn et al., 2004). The results are also verified by simulation studies.

References

KRUPPA, J., ZIEGLER, A. and KOING, I. R. (2012): Risk Estimation and Risk Prediction Using Machine Learning Methods. Human Geneticss, 131(10), 1639-1654. BREIMAN, L. (1996): Bagging predictors. Machine Learning, 24(2), 123–140. HOTHORN, T., LAUSEN, B., BENNER, A. and RADESPIEL-TROGER, M. (2004): Bag- ging Survival Trees. Statistics in Medicine, 23(1), 77-91.

Keywords

ENSEMBLE METHODS, k-NEAREST NEIGHBOURS, NON-INFORMATIVE FEATURES

109 Multivariate functional regression analysis with application to classiﬁcation problems

Tomasz Gorecki´ 1 and Waldemar Wołynski´ 2

1 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poznan´ [email protected] 2 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poznan´ [email protected]

Abstract. Data in the form of a continuous vector function on a given interval are referred to as multivariate functional data. These data are treated as realizations of multivariate random processes. Our work presents a new method of classification based on multivariate regression analysis (linear, logistic and nonparametric regression) for this type of data. In the case of more than two groups the methods of combining classifiers are also discussed. This classification method for multivariate functional data is presented, illustrated and discussed in the context of analyzing real time series.

References

GORECKI,´ T., KRZYSKO,´ M. and WASZAK, Ł. (2014): Functional Discriminant Coordi- nates. Communication in Statistics - Theory and Methods 43, 1013–1025. HORVATH,´ L. and KOKOSZKA, P. (2012): Inference for Functional Data with Applications. Springer. New York. MULLER,¨ H.G. and STADTMULLER,¨ U. (2005): Generalized functional linear models. The Annals of Statistics, 33, 774–805. RAMSAY, J.O. and SILVERMAN, B.W. (2005): Functional Data Analysis, Second Edition. Springer. New York.

Keywords

MULTIVARIATE FUNCTIONAL DATA, FUNCTIONAL DATA ANAL- YSIS, REGRESSION ANALYSIS, CLASSIFICATION

110 Assessing the reliability of a multi-class classiﬁer

Luca Frigau1, Claudio Conversano2 and Francesco Mola2

1 Dipartimento di Scienze della Vita e dell’Ambiente [email protected] 2 Dipartimento di Scienze Economiche ed Aziendali {conversa,mola}@unica.it

Abstract. Multi-class learning requires a classifier to discriminate among a large set of K classes in order to define a classification rule able to identify the correct class for new observations. The resulting classification rule could not always be robust, particularly when imbal- anced classes are observed or the data size is not large. In this paper a new approach is presented aimed at evaluating the reliability of a classification rule. It uses a standard classifier but it evaluates the reliability of the obtained classification rule by re-training the classifier on resampled versions of the original data. User-defined misclassification costs are assigned to the obtained confusion matrices and then used as inputs in a Beta regression model which provides a cost-sensitive weighted classification index. The latter is used jointly with another index measuring dissimilarity in distribution between observed classes and predicted ones. Both index are defined in [0,1] so that their values can be graphically represented in a [0,1]2 space. The examination of the the points in the [0,1]2 space for each classifier allows us to evaluate its reliability on the basis of the relationship between the values of both indexes obtained on the original data and on resampled versions of it.

References

CRIBARI-NETO, F. and ZEILEIS, A. (2010). Beta regression in R. Journal of Statistical Software, 34(2), 1–24. FERRARI, S. and CRIBARI-NETO, F. (2004). Beta regression for modeling rates and pro- portions. Journal of Applied Statistics, 31(7), 799–815 VAN SON, R. (1995): A method to quantify the error distribution in confusion matrices. Pro- ceedings of the Fourth European Conference on Speech Communication and Technology, EUROSPEECH 1995, Madrid, pp. 22772280.

Keywords

MULTI-CLASS LEARNING, CLASSIFIER PERFORMANCE, BETA REGRESSION, VISUALIZATION

111 20 CON-3B: Clustering I

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 3 A Biclustering Model and Method for a Sparse Binary Data

Tadashi Imaizumi1

School of Management & Information Sciences, Tama University, 4-1-1 Hijirigaoka, Tama city, Tokyo, JAPAN [email protected]

Abstract. Segmenting customers has been an important data processing in Marketing Anal- ysis. However, a simultaneous clustering of customers and products become more important than it as each customer has many paths to get information on these products. Biclustering methods will be attractive method for these data matrix or data array. When applying these methods to analyze a binary matrix or array, we will have to distinguish two cases. One is the case that we got binary data by our fault on the data collection. Then we can represent customers and products quantitatively for this case. The other case is that binary data inherit from relationship between a customer and a product, for example, recalling experiment for the favorite drinks. And data matrix become more sparse matrix, and the number of clusters will increase as the number of products increase. A biclustering model with overlapping clusters for a sparse binary matrix will be proposed. And a method will also proposed. An application to a real data set will be shown.

References

KAISER, S. (2011): Biclustering: Methods, Software and Application,Verlag Dr. Hut VAN MECHELEN, I.,BOCK, H-H, and DE BOECK, P. (2005): Two-mode clustering methods. In: Everitt B, Howell D. (Eds) Encyclopedia of behavioral statistics. Wiley, Chich- ester, 2081-2086 VICHI, M. (2001): Double k-means clustering for simultaneous classiﬁcation of objects and variables. In: Borra S, Rocci R, Schader M (Eds) Advances in classiﬁcation and data analysis. Heidelberg, Springer, pp 43-52

Keywords

OVERLAPPING CLUSTERS, SUBSET CLUSTERING, TEXT MIN- ING

113 Geographic clustering through aggregation control

Daher Ayale

Lab-STICC CNRS UMR 6285 / Universite´ de Bretagne Sud [email protected]

Abstract. The actors in spatial decision are often led to deﬁne strategic zoning checking both constraints of structural homogeneity and spatial cohesion. In most cases, they use traditional clustering algorithms that do not take into account geographic information, and possibly correct the excessive fragmentation with Ad hoc heuristics.

Usually, in clustering algorithms, the number of clusters is initially fixed. In this paper, it is considered that, equivalently, the number of geographically related components (called regions) R is also initially fixed, with, of course, the constraint : R ≥ C. Rather than seeking an absolute solution to this optimization problem, which is computationally difficult, the proposed method uses an algorithm to control the intra-class geographic aggregation in order to approach (possibly achieve) the constraint set on the number of regions.

We present in detail the proposed method, and then show how the level of aggregation can be controlled by a parameter of the algorithm. Finally we seek the parameter value that leads to the most appropriate solution, and we study the dynamic evolution of clusters and regions according to the parameter control.

The results are illustrated on a real example.

References

OLIVER, M.A. and WEBSTER, R. (1989): A Geostatistical Basis for Spatial Weighting in Multivariate Classiﬁcation. Mathematical Geology, 21, 275–289.

Keywords

Geographic clustering, Connected components, Geographic Informa- tion.

114 Three-way clustering problems in regional science∗

Andrzej Sokołowski1, Małgorzata Markowska2, and Danuta Strahl3

1 Cracow University of Economics [email protected] 2 Wroclaw University of Economics [email protected] 3 Wroclaw University of Economics [email protected]

Abstract. Three-way clustering problems have been considered since many years. They are popular specially in psychology (Kiers (1991)) and chemistry (Smilde (1992)), but some of the propositions and methods are of more general nature (Basford and McLachlan (1985), Vermunt (2007)). In regional science three-way data matrices consist of objects (regions), variables and time units (years). Direct simultaneous clustering seems to be senseless since different variable values in different objects, in different years should be treated as individual entities and grouped into clusters, which are then impossible to interpret. On the other hand, asking which variables, in which regions and when, follow homogeneous pattern is meaningful. In the paper we consider the possibility of answering this question through the decomposition of complex clustering task into two or three sub-problems, starting from one of three modes – variables, region or time units. The example on Eurostat data is also presented.

References

BASFORD, K.E. and McLACHLAN, G.J. (1985): The mixture method for clustering applied to three-way data. Journal of Classiﬁcation, 2, 109-125. KIERS, H.A. (1991): Hierarchical relations among three-way methods. Psychometrika, 56(3), 449-470. SMILDE, A.K. (1992): Three-way analyses problems and prospects. Chemometrics and In- telligent Laboratory Systems, 15(2), 143-157. VERMUNT, J.K. (2007): A hierarchical mixture model for clustering three-way data sets. Computational Statistics & Data Analysis, 51, 5368-5376.

Keywords

DYNAMIC CLUSTERING, THREE-WAY CLUSTERING

∗ The paper was prepared within the project ﬁnanced by the Polish National Centre for Sci- ence, decision DEC-2013/09/B/HS4/0509.

115 Evaluating the Necessity of a Triadic Distance Model

Atsuho Nakayama

Tokyo Metropolitan University, 1-1 Minami-Ohsawa, Hachioji-shi, 192-0397, Japan, [email protected]

Abstract. A number of studies have examined multi-way proximity generalizations of multidimensional scaling (MDS) models. Some of these have proposed one-mode three-way proximity data analysis to investigate triadic relationships among three objects (e.g. De Rooij (2002), De Rooij and Gower (2003)). However, Gower and De Rooij (2003) concluded that the results of a one-mode three-way MDS are similar to those of a one-mode two-way MDS. However, there is no technique for judging whether one-mode three-way MDS analysis or one-mode two-way MDS analysis is more appropriate. It will be useful to consider the reasons for these similarities and to establish a technique for examining the necessity of one-mode three-way MDS analysis. Here we propose a technique that evaluates the necessity of a triadic distance model, using a log-linear model. When the analysis of the log-linear model shows that the two objects i and j are independent of the objects k, the one-mode three-way proximity data can be reduced to one-mode two-way proximity data by merging the objects k. On the other hand, one-mode three-way proximity data should not be reduced to one-mode two-way proximity data when the analysis of the log-linear model shows that the two objects i and j are not independent of the objects k. Then the present study discusses the similarities and differences between triadic and dyadic relationships.

References

DE ROOIJ, M. (2002): Distance Models for Three-way Tables and Three-way Association. Journal of Classification, 19, 161–178. DE ROOIJ, M. and GOWER, J. C. (2003): The Geometry of Triadic Distances. Journal of Classification, 20, 181–220. GOWER, J. C. and DE ROOIJ, M. (2003): A Comparison of the Multidimensional Scaling of Triadic and Dyadic Distances. Journal of Classification, 20, 115–136.

Keywords

DYADIC DISTANCE, LOG-LINEAR MODEL, MDS, TRIADIC DIS- TANCE

116 21 CON-3C: Statistics and Data Analysis III

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 4 Employing cluster analysis for the study of income tax law of Greece

Leonidas Tokou1, Iannis Papadimitriou1 and Athanasios Vazakidis1

1University of Macedonia, Department of Applied Informatics, 156 Egnatia Street, GR-54006 Thessaloniki, Greece. [email protected], [email protected], [email protected]

Abstract. The study of a country’s tax policy structure reveals economic, political and social aspects of the examined economy thus representing a considerable research challenge. Under this consideration, we focus on the relationship between income tax policy, imposed upon individuals and legal entities and government behavior in Greece. To this end, we use hierarchical cluster analysis in order to identify patterns regarding the context of the Greek tax law. By ascribing the contextual changes in the law text on variables and defining clusters of year periods, we try to explore patterns regarding the tax policy in order to provide a meaningful approach on the diachronic study of the income tax law. Finally, preliminary results, based on eight tax benefits, reveal what in the theoretical notion applies about politicians’ behavior and their perception that income tax does influence electoral votes. Those results, which are based on the whole and vast income tax law text, seem to offer a new perspective and a first guidance in regard to the understanding and analysis of the income tax law.

References

HIRSCHBERG, J.G., MAASOUMI, E. and SLOTTJE, D.J. (1991): Cluster analysis for measuring welfare and quality of life across countries Journal of Econometrics, 50, 131–150. EVERITT, B. (1993): Cluster analysis. Edward Arnold A division of Hodder & Stoughton, London. LEBART, L. (1994): Complementary use of correspondence analysis and cluster analysis. In: M. Greenacre, J. Blasiusł(Eds.): Correspondence Analysis in the Social Sciences. Aca- demic Press, London, 162–178.

Keywords

CLUSTER ANALYSIS, INCOME TAX LAW, TAX POLICY

118 A comparison of heuristic and model-based clustering methods for dietary pattern analysis

Claudia Bornhorst¨ 1, Benjamin Greve1,2 and Iris Pigeot1,2

1 Leibniz-Institute for Prevention Research and Epidemiology - BIPS GmbH, Achterstr. 30, 28359 Bremen, Germany, [email protected] 2 University of Bremen, Faculty of Mathematics and Computer Science, 28359 Bremen, Germany

Abstract. Cluster analysis of food frequency questionnaire (FFQ) data is widely applied to identify dietary patterns. Unfortunately, the commonly used K-means algorithm and Ward’s method are biased towards the identification of spherical clusters of equal volume. Recently, a more flexible method based on Gaussian mixture models has been suggested. Our study aimed to find the most appropriate method for clustering FFQ data. The three clustering methods were applied to simulated datasets with different cluster structures in order to compare their performance knowing the true cluster membership of the observations. Furthermore, the methods were applied to real FFQ data to explore their performance in practice. The Gaussian mixture model outperformed the other methods in the simulation study in up to 90% of the cases depending on the simulated cluster structure where especially Ward’s method performed poorly. When applying the three methods to real data, all methods identified three similar dietary patterns: a “non-processed”cluster being characterized by a high consumption of fruits, vegetables and whole-meal bread, a “balanced” cluster with only slight preferences of single foods and a “junk food” cluster showing a high consumption of fast food, sweet snacks, dairy products and breakfast cereals. The simulation study suggests that clustering via Gaussian mixture models should be preferred due to its higher flexibility. K-means seems to be a good alternative, being easier to use while giving similar results when applied to real data.

References

Fahey MT, Thane CW, Bramwell GD and Coward WA (2007): Conditional Gaussian mixture modelling for dietary pattern analysis. J R Stat Soc Ser A Stat Soc, 170, 149–166.

Keywords

Gaussian mixture model, IDEFICS study, multidimensional data

119 Reiﬁcation of subjective vehicle impressions - Objectiﬁcation of the individual perceived quality from head-up-display images

Sonja Maria Koppl¨

Working ﬁeld of image signal processing, TU Dortmund, 44227 Dortmund [email protected]

Abstract. For the assessment of a vehicle are countless features of importance. Likewise are the subjective evaluations essential for the quality perception and the purchase decisions. Unfortunately is the objectiﬁcation of the personal impressions very complex. In many cases are corresponding investigations based on small samples and the basic procedure is rarely shown systematically [1]. During this work is the entire procedure, from the data acquisition to the subjective evaluation and calculation of objective parameters up to the evaluation with statistical methods, described gradually. Focused is thereby the handling of large sample sizes relating to customer surveys. It is shown that simple clustering methods are able to split the amount of data into a small number of subgroups. The aim is that one cluster contains only units with the same subjective perception. Consequently is the rating of one single element equal to the assessment of all other units in the same cluster. This ensures that the determination of the subjective evaluation is made considerably easier. Similarly, the effort of customer surveys is reduced because only one element of each cluster must be rated by the participants. The investigations carried out here are based on the perceived quality of the virtual head- up-display image. Thereby is clariﬁed how the perceived image quality can be measured and assessed.

References

BECKER, K. and HAAL, M. (2005): Objektivierung subjektiver Fahreindrcke Methodik und Anwendung. Benchmarking der Leerlaufgeruschqualitt von Personenkraftwagen. In: DAGA 2005. 401–402.

Keywords

OBJECTIFICATION, UNSUPERVISED LEARNING, EXECUTION OF CUSTOMER SURVEYS

120 Reiﬁcation of subjective vehicle impressions - Apply classiﬁcation methods to predict the perceived quality from head-up-display images

Sonja Maria Koppl¨

Working ﬁeld of image signal processing, TU Dortmund, 44227 Dortmund [email protected]

Abstract. Numerous studies about the perception of subjective impressions in a vehicle exist. These studies have all one thing in common; they use statistical standard methods to evaluate the relationship between the subjective impressions and the technical parameters. During this work is investigated whether classification methods are more suitable for that. Based on the virtual image of the head-up-display is an algorithm developed, which is able to predict the perceived image quality by using a classifier approach. The base of operation form previously conducted customer surveys, where representative virtual images are evaluated according to the subjective quality perception. From these study results is now a prediction system developed. Therefore is it first necessary to define objective features that describe the subjective impressions. Subsequently, the characteristic values of these criteria are for each image of the study sample determined. Taken together form the subjective rat- ings and the feature presentations of the test samples the trainings set for the classifier. During the training of the classification technique is the subjective sensation recreated by the objective features. The resulting algorithm is then able to predict the subjective impression for any virtual image. The evaluation of empirical studies with supervised learning methods is largely new and does not correspond to the procedure described in definitive books.

Keywords

QUALITY PREDICTION, SUPERVISED LEARNING, EVALUA- TION OF CUSTOMER SURVEYS

121 22 CON-3D: Data Analysis in Marketing II

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 5 Evaluating Advertising Campaigns Using Image Data Analysis and Classiﬁcation

Daniel Baier, Sarah Frost, and Ines Daniel

Brandenburg University of Technology Cottbus-Senftenberg,Chair of Marketing and Innovation Management, Erich-Weinert-Straße 1, 03046 Cottbus, Germany {daniel.baier | sara.frost | ines.daniel}@tu-cottbus.de

Abstract. n many consumer markets the physicochemical differences between products de- cline. Consequently, to strengthen their products? unique positioning, producers rely on advertising campaigns that connect their product with favorable atmospheres. So, e.g., Krom- bacher uses blue lakes and green islands to connect their pilsner with naturalness whereas Jever uses seashores to connect their pilsner with calm and relaxation. However, often, the question arises whether these positionings are really unique, and how the campaign should be continued. Often, confusion experiments are used to answer this question: Respondents are confronted with (masked) print ads and asked to name the advertised product. The allocation frequencies are used as measures for the positionings? uniqueness and stability. Recently, an alternative approach has been proposed (Baier et al. 2012, Frost 2014): The print ads are treated as images from which low and high level features are extracted (e.g., color and edge histograms, number of detected faces). Then, featurewise distances are calculated and aggregated to score the differences between the print ads and, consequently, the uniqueness and stability of the products? positionings. In this paper we analyze the positionings of 16 German beer brands when using this measurement approach. 1,600 print ads are collected by search engines and used as training and testing samples for supervized learning with discriminant analysis and support vector machines. The results are compared with the results of a confusion experiment with 446 respondents.

References

BAIER D., DANIEL, I., FROST, S., and NAUNDORF, R. (2012); Image Data Analysis and Classiﬁcation in Marketing. Advances in Data Analysis and Classiﬁcation, 6, 4, 253–276. FROST, S. (2014): Distanzmaße in der Bildahnlichkeitsanalyse¨ Neue Verfahren und deren Anwendung im Marketing. Dr. Kovac, Hamburg.

Keywords

IMAGE DATA ANALYSIS AND CLASSIFICATION

123 Accommodating Heterogeneity and Nonlinearity in Price Effects for Predicting Brand Sales and Proﬁts

Winfried J. Steiner1, Stefan Lang2, Anett Weber3, and Peter Wechselberger4

1 Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected] 2 Department of Statistics, University of Innsbruck, A-6020 Innsbruck, Austria [email protected] 3 Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected] 4 [email protected]

Abstract. We propose a hierarchical Bayesian semiparametric approach to account simultaneously for heterogeneity and functional flexibility in store sales models. To estimate own- and cross-price response flexibly, a Bayesian version of P-splines introduced by Lang and Brezger (2004) is used. Heterogeneity across stores is accommodated by embedding the semiparametric model into a hierarchical Bayesian framework that yields store-specific own- and cross-price response curves. More specifically, we propose multiplicative store-specific random effects that scale the nonlinear price curves while their overall shape is preserved. Esti- mation is fully Bayesian and based on novel MCMC techniques. In an empirical study, we demonstrate a higher predictive performance of our new flexible heterogeneous model over competing models that capture heterogeneity or functional flexibility only (or neither of them) for nearly all brands analyzed. In particular, allowing for heterogeneity in addition to functional flexibility can improve the predictive performance of a store sales model considerably, while incorporating heterogeneity alone only moderately improved or even decreased predictive validity. Taking into account model uncertainty, we show that the proposed model leads to higher expected profits as well as to materially different pricing recommendations.

References

BREZGER, A. and STEINER, W.J. (2008) Monotonic Regression Based on Bayesian P- Splines: An Application to Estimating Price Response Functions From Store-Level Scan- ner Data. Journal of Business & Economic Statistics, 26, 90-104. LANG, S. and BREZGER, A. (2004) Bayesian P-Splines. Journal of Computational and Graphical Statistics, 13, 183-212. LANG, S., UMLAUF, N., WECHSELBERGER, P., HARTGEN, K. and KNEIB, T. (2013) Multilevel structured additive regression. Statistics and Computing, to appear (DOI 10.1007/s11222-012-9366-0).

Keywords 124 SALES RESPONSE MODELING, HETEROGENEITY,FUNCTIONAL FLEXIBILITY, SALES PREDICTION, EXPECTED PROFITS Adaptive Discrete Choice Models for Brand Price Trade-Off

Peter Kurz1

TNS Infratest GmbH Landsberger Str. 284 80687 Munich [email protected]

Abstract. In a discrete choice experiment every respondent evaluates several choice sets with a deﬁned number of alternatives, so that repeated observations are made for each respondent. When the respondents are heterogeneous, every respondent has a own preference structure in the panel mixed logit model. The idea of our approach is to generate each respondent’s individual choice set in a Bayesian framework based on her previous responses and based on the conditional logit model. In this paper we ﬁt a different conditional logit model to construct the individual designs and to allow for heterogeneity in the population. For constructing the sensible prior distributions we use as source a well evaluated framework of consideration set, buying behavior and price knowledge questions. This prior information is used as a basis to generate an initial design with 4 choice sets for each respondent and as an input for the Bayesian design generation phase. A Bayesian analysis of the data from the initial stage is needed, because too few information is available at this stage to use maximum likelihood estimation. In the second stage of the adaptive design generation the prior information for the design construction is updated after each choice task and each following choice set is constructed using the newly updated prior. We applied the approach in a brand price trade-off type discrete choice experiment to study the buying behavior of dog owners in Germany. Half of the respondents are assigned to a classical DCM and the other half to the new adaptive DCM. Each respondent participating in the new approach, see 4 constructed choice tasks based on his previous answers and 11 adaptive choice stes generated with the bayesian design generation. The results show that the heterogeneity could be captured much better with the new adaptive approach.

References

Jie, Y., Goos, P., Vandebroek, M.,(2011): Individually adapted sequential Bayesian conjoint- choice designs in the presence of consumer heterogeneity International Journal of Re- search in Marketing 28 pp. 378-388. Sonnevend, G., (1985): An Analytic Center for Polyhedrons and New Classes of Global Al- gorithms for Linear (Smooth, Convex) Programming. Proceedings of the 12 IFIP Con- ference on System Modeling and Optimization, Budapest. Toubia, O., Hauser, J. R., Garcia, R., (2007): Probabilistic polyhedral methods for adaptive choice based conjoint analysis: Theory and application Marketing Science, 26(5) pp. 596- 610.

125 126 Peter Kurz Keywords

Discrete Choice Models, Adaptive Design Algorithms, Analytic Cen- ter, Brand Price Trade-Off Lead User Classiﬁcation for Data Analysis in Marketing

Alexander Sann¨ and Daniel Baier

Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus-Senftenberg, Postbox 101344, 03013 Cottbus, Germany {alexander.saenn, daniel.baier}@b-tu.de

Abstract. It is crucial for businesses to generate innovation in order to establish a competitive market presence. The management of product portfolios and product lines is a high demanding task and relies for example on preference data from ordinary and leading customers. The implementation of lead users in the new product development process is well-known to generate successful breakthrough innovations and to foster new product lines as well as major functional improvements. Literature indicates also that lead users tend to possess extreme needs that ordinary customers might never have. This research joins the discussion on lead user classification, needs and characteristics to stimulate successful innovations. Former findings in the field of mountain biking and industrial IT-security solutions led to an adapted approach a) to integrate user innovations within the lead user screening approach, b) to perform preference measurement upon freely revealed user contributions, and c) to provide a better understanding of lead user contributions. The empirical setting is based on complex industrial goods and reveals valuable implications for application.

References

RESNICK, P. and IACOVOU, N. and SUCHAK, M. and BERGSTROM, P. and RIEDL, J. (1994): GroupLens: An Open Architecture for Collaborative Filtering of Netnews. Pro- ceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, 175– 186. SANN,¨ A. and KRIMMLING, J. and BAIER, D. and NI, M. (2013): Lead User Intelligence for Complex Product Development: The Case of Industrial IT-Security Solutions. Inter- national Journal of Technology Intelligence and Planning, 9, 232–249. VON HIPPEL, E. (1986): Lead Users: A Source of Novel Product Concepts. Management Science, 32, 791–805.

Keywords Preference Measurement, Lead User, Collaborative Filtering, Innova- tion

127 23 CON-3E: Big Data Analytics

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 6 Missing Data Methods for Big Data Analysis

Dieter William Joenssen

Ilmenau University of Technology, Helmholtzplatz 3, 98693 Ilmenau, Germany [email protected]

Abstract. The analysis of Big Data, a process characterized by the volume, variety, and velocity of the data, has received perpetual attention over the past years. This attention has prompted the development of methods to fully utilize the potential of Big Data and process models to do so consistently. However, many challenges within the analysis process remain largely un- addressed. Big Data are rarely gathered for the purpose of a single analysis, but rather are compiled from a variety of sources collected for different purposes. This inevitably leads to sampling and data quality issues, e.g., missing values, which negatively impact analysis results. While managing missing data is in every data mining process model, e.g., CRISP-DM 1.0, literature discussing missing data methods in the context of Big Data is scant to date. This discussion requires the consideration of constraints imposed by the data’s volume and velocity, which is wholly lacking in literature. To this end, computational complexity theory is used to evaluate a selection of missing data methods. Resultant algorithmic complexity measures are presented in conjunction with other algorithm properties. Conclusions reached allow an optimal selection of missing data method in light of computational constraints.

References

CHAPMAN, P., CLINTON, J., KHABAZA, T., REINARTZ, T. and WIRTH, R. (1999): The CRISP-DM Process Model. CRISP-DM Consortium. KNUTH, D.E. (1997): The Art of Computer Programming. Addison-Wesley, Menlo Park. PEARSON, R.K. (2005): Mining Imperfect Data. Society for Industrial and Applied Mathe- matics, Philadelphia. LITTLE, R.J.A. and RUBIN, D.B. (2002): Statistical Analysis with Missing Data. Wiley, Hoboken.

Keywords

Missing Data Methods, Imputation, Big Data Analysis, Computational Complexity

129 Big Data Oriented Symbolic Data Analysis in Cloud

Hiroyuki MINAMI1 and Masahiro MIZUTA2

1 Information Initiative Center, Hokkaido University [email protected] 2 Information Initiative Center, Hokkaido University [email protected]

Abstract. The word “Big Data” has become a universal term while it was regarded as a buzz word. We, statisticians, have studied and utilized many approaches on Big Data, but we have to face another kind of practical problems to handle them beyond data analysis, say, “How to manage Big Data from the viewpoint of data engineering?” For example, we sometimes fail to read a set of Big Data into R directly due to the size and/or the irregularity. In a computer cloud, Hadoop is known as a popular solution to handle massive data and “Key-Value” style is one of the primary concepts. However, it might be far from our conventional strategies and hard to utilize some powerful CRAN libraries on Hadoop. Now, it is time to change our idea to Big Data and cloud oriented. In the study, we discuss the difﬁculty in Big Data analysis and shape a novel solution in cloud with Symbolic data analysis (SDA). We introduce, so to speak, “Big Data oriented” SDA which makes use of a computer cloud in most processes including handling the observations and making tables, and offer the availability and the effectiveness through real Big Data sets whose sizes are around hundreds of gigabytes, including raw level Internet trafﬁc data and radiation monitoring data in Japan since Fukushima nuclear accident.

References

MINAMI, H. and MIZUTA, M. (2013). A Big Data Intensive Application System with Sym- bolic Data Analysis and its Implementation. Conference of the International Federation of Classiﬁcation Societies IFCS-2013, 41. WHITE, T. (2011): Hadoop: The Deﬁnitive Guide (2nd edition). O’Reilly. NUCLEAR REGULATION AUTHORIZY IN JAPAN: Monitoring information of environmental radioactivity level. http://radioactivity.nsr.go.jp/map/ja/ download.html.

Keywords

HIGH-PERFORMANCE COMPUTING, MASSIVE DATASET, HADOOP

130 Big Data Analytics vs. Classical Data Science

Claus Weihs

Department of Statistics, TU Dortmund University [email protected]

Abstract. This paper starts with an overview over the potentials of Big Data Analytics. Then, it discusses the differences between Classical Data Science and Big Data Analytics. We will mainly consider two cases. In the case of many variables, we will compare dimension reduction and variables selection. In the case of many observations, we will discuss approximations based on data partitions and what (online) methods are typically used for data streaming. Links to current research lines will be given.

Keywords

Big Data Analytics, many variables, many observations

131 Epistemic Uncertainty Sampling for Active Learning on Data Streams

Ammar Shaker and Eyke Hullermeier¨ 1

Computer Science Department, University of Paderborn, Germany

Abstract. Methods for incremental learning from so-called data streams have received increasing attention in recent years. Due to the potentially large volume and high speed of streaming data, combined with possible costs for a supervision of the training process, active learning strategies for filtering out those training examples that appear to be most informative for the learner are specifically important in this context. A common strategy to estimate the degree of information comprised by an example is to quantify the learner’s uncertainty on that example—selecting those examples of highest uncertainty is also known as uncertainty sampling. Here, we elaborate on a novel sampling strategy that builds on a recent method for reliable classification (Senge et al., 2014), in which a distinction is made between two types of uncertainty: aleatoric uncertainty, which is due to statistical variability and effects that are inherently random, and epistemic uncertainty, which is caused by a lack of knowledge. Arguing that the latter is more relevant as a target for active learning, we develop an approach to active learning on data streams that focuses on the epistemic part of the total uncertainty. More specifically, we extend our previous work on instance-based classification on data streams (Shaker and Hullermeier,¨ 2012) by an active learning component. Apart from a description of the method, the paper also provides experimental results showing its effectiveness.

References

Senge, R., Bosner,¨ S., Dembczynski, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., and Hullermeier,¨ E. (2014) Reliable Classification: Learning Classifiers that Distinguish Aleatoric and Epistemic Uncertainty, Information Sciences, 255:16–29. Shaker, A. and Hullermeier,¨ E. (2012) IBLStreams: A System for Instance-based Classifica- tion and Regression on Data Streams, Evolving Systems, 3:235–249.

Keywords

DATA STREAMS, ACTIVE LEARNING, CLASSIFICATION, UN- CERTAINTY

132 24 CON-3F: Statistics and Data Analysis IV

Thursday, July 3, 2014: 14:35 - 16:15, West Hall 6 Specialization in Smart Growth Sectors vs. Effects of Workforce Number Changes in the European Union Regional Space

Elzbieta˙ Sobczak1 and Marcin Pełka2

1 Wrocław University of Economics, Department of Regional Economics, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected] 2 Wrocław University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected]

Abstract. The purpose of the study is to identify the relations between the level of specialization in smart growth sectors and the effects of workforce number changes in the NUTS 2 regions of the European Union. Multivariate data analysis methods, structural-geographic shift-share method and regional specialization indices were applied in the study. The structure of workforce in economic sectors, separated based on the intensity of research and development activities in the NUTS 2 regions in the period 2008-2012, constituted the subject of analysis. The application of shift-share analysis allowed for determining structural, competition and the allocation of workforce number changes in smart growth sectors effects against the reference area. Multivariate data analysis methods facilitated the typology of the analyzed regions against the level of specialization and the type of effects resulting from workforce number changes in smart growth sector as well as determining the relations between them.

References

BARFF, R.A., KNIGHT III, P.L. (1988): Dynamic Shift-Share Analysis, Growth and Change, no 19/2. HATZICHRONOGLOU, T. (1996): Revision of the High-Technology Sector and Product Classiﬁcation. OECD, Paris KRUGMAN, P.R. (1991): Increasing Returns and Economic Geography. Working Paper, no 3275, National Bureau of Economic Research. MIDELFART-KNARVIK, H.H. (2002): Delocation and European Integration, Economic Pol- icy, no 35.

Keywords

HIGH-TECH SECTORS, KNOWLEDGE-INTENSIVE SERVICES, REGIONAL SPECIALIZATION, EUROPEAN UNION REGIONS

134 Comparison of working conditions in European countries with respect to gender, age and education

Zerrin Asan Greenacre1 and Michael Greenacre2

1 Anadolu University, Eskisehir, Turkey [email protected] 2 Pompeu Fabra University, Barcelona, Spain [email protected]

Abstract. Our study aims to explain how working conditions differ across European countries. Data used are from the fifth wave of the European Working Conditions Survey (EWCS), which has been carried out by Gallup Europe and its network of national institutes. Face-to- face interviews were carried out with persons in employment in the 27 EU member states. The questionnaire covered several aspects of working conditions, including, workplace design, working hours, work organization, social relationships and physical environment in the workplace. In this study, we are interested in comparing the responses of men and women within each country as well as between the countries. As a first analysis, individual-level data have been aggregated into country-gender groups and the response percentages computed. Correspon- dence analysis is used to interpret and quantify differences between countries and between genders. Once the “significant” dimensionality of the solution is determined, the scale values for country and gender can be used in a type of multivariate analysis of variance, called structured data analysis. In a subsequent analysis, we extend this approach to include other demographic variables such as age and education group.

References

GREENACRE, M. and BLASIUS, J. (2006): Multiple Correspondence Analysis and Related Methods . Chapman & Hall/CRC, London. GREENACRE, M. (2007): Correspondence Analysis in Practice, Second Edition . Chapman & Hall/CRC, London,. LE ROUX, B. and ROUANET, H. (2004): Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis. Kluwer, Dordrecht. LE ROUX, B. and ROUANET, H. (2010): Multiple Correspondence Analysis . Sage Publica- tions, USA.

Keywords

WORKING CONDITIONS, CORRESPONDENCE ANALYSIS, STRUC- TURED DATA ANALYSIS

135 The Identiﬁcation of Relations Between Smart Growth and Sensitivity to Crisis in the European Union Regions – Panel Data Analysis

Beata Bal–Domanska´

Wrocław University of Economics, Department of Regional Economics, Nowowiejska 3, 58-500 Jelenia Gora,´ Poland, [email protected]

Abstract. The purpose of the article is an attempt to measure and assess the sensitivity to crisis of the European Union regional economies having considered their sector structure. The research results presented in literature references indicate that the differences in sector structure of particular economies were the main reason of diverse crisis consequences. The study covered the NUTS-2 level regions in the period 2004-2011. Econometric models for panel data with adequate estimation techniques are used for the assessment of the EU regions sensitivity to the effects of 2008 crisis. The application of panel data allows for including in the analysis also the speciﬁc, non-measurable, individual effects for particular regions and time, what seems a particularly useful tool for the description of regional economies growth in the crisis.

References

ARELLANO, M. (2003): Panel Data Econometrics. Oxford: Oxford University Press. GREENE, W.H. (2003): Econometric analysis. Pearson Education International, New Jersey. GROOT, S.T.P., MHLMANN, J.L., GARRETSEN, J.H., DE GROOT, H.L.F. (2011): The crisis sensitivity of European countries and regions: stylized facts and spatial heterogeneity. Cambridge Journal of Regions, Economy and Society, 4(3), 437–456.

Keywords

SENSITIVITY TO CRISIS, PANEL DATA ANALYSIS, REGIONAL ECONOMICS

136 A model for comparing government expenditure on civil servants ? compensation of gross wages and salaries in EU24

J.C. Nwaubani and N. Kapoulas

University of Macedonia, Department of Applied Informatics, Thessaloniki-Greece

Abstract. The debt crisis that struck some EU members and is far from having been sur- mounted, has prompted governments to embark on a policy of strict budgetary austerity. Re- cent developments in the European Union reflects the reformation needs of most EU governments, in form of wage and employment moderation and, in some cases, even cuts in public sector wages, salaries and employment. It is in this respect that this study finds it worthwhile to address the following issue by using the most accurate association model of the Categorical Data Analysis (CDAS) in comparing government expenditure on civil servants ? compensation of gross wages and salaries in 24 countries of the EU. Eurostat data are used in this study from 2002-2011. The analysis of association (ANOAS) table is given in order to ascertain the percentage of the data which is covered by each model. We estimate the association model to find the model with the best fit and in conclusion we find out that the Row-Column Effects Association Model (RC) of the multivariate model (M = 8) has the best fit among all since it covers more than 99.96% of the total data.

References

[1]Clogg, C.C. (1990), Analysis of Association (ANOAS) program. [2]Daianu, D. and Albu, L.-L. (1996): Strain and the Inﬂation-Unemployment Relationship: A Conceptual and Empirical Investigation. Econometric Inference into the Macroeconomic Dynamics East European Economies, University of Leicester, UK Research Memoran- dum [3]Diewert, W. Erwin. (1995) Axiomatic and Economic Approaches to Elementary Price indexes. [4]Eliason ?. Scott-Clifford Clogg (1990), Categorical Data Analysis (CDAS) NBER working paper 510 [5]Eurostat /JP (2010): European Data Agency, ?Eurostat-population statistics at regional level?. [6]Fairbanks, Michael. (2000). Changing the Mind of a Nation: Elements in a Process for Creating Prosperity. In Culture Matters, Huntington, editors, New York: Basic Books, pp. 270-281. [7]Goodman, L.A., (1979a). Multiple Models for the Analysis of Occupational Mobility, Ta- bles and Other Kinds of Cross - Classiﬁcation Tables. American Journal of Sociology, 84:804-819.

137 138 J.C. Nwaubani and N. Kapoulas

[8]Goodman, L.A., (1979b). Multiple Models for the Analysis of Occupational Mobility, Ta- bles and Other Kinds of Cross - Classiﬁcation Tables. American Journal of Sociology, 84:804-819. [9]Goodman, L.A., (1981a). Association models and the Bivariate Normal for Contingency, Tables with Ordered Categories. Biometrica, Volume 68:p. 347-55. [10]Goodman, L.A., (1981b).Association Models and Canonical Correlation in the Analysis of Cross- Classiﬁcations Having Ordered Categories, Journal of American Statistical Association, p.20-34. [11]Haritou A, Nwaubani J. C., (2009) Categorical Data Analysis (University Press) NBER- working Paper, University of Macedonia Thessaloniki Greece, Department of Applied Informatics. [12]Haritou A, Nwaubani J. C., (2010) Categorical Data Analysis (University Press) N- working paper. [13]Haritou A, Nwaubani J. C., (2011) Categorical Data Analysis (University Press) N- working paper.

Keywords

Association model, Log-linear and non-linear models, govt. expend, civil servants, EU27. 25 CON-4A: Machine Learning and Knowledge Discovery V

Friday, July 4, 2014: 08:30 - 10:35, West Hall 2 Multi-label classiﬁcation using multivariate linear regression

Sarel Steel and Surette Bierman

Stellenbosch University, South Africa [email protected]

Abstract. Multi-label classification problems arise in scenarios where every data instance can be associated simultaneously with more than one of several available labels or categories. Application areas include music information retrieval (instrument recognition in polyphonic music), bioacoustics (identifying bird species from a chorus of sounds), text and image annotation. Many algorithms have been proposed for dealing with multi-label classification problems. Problem transformation methods transform the multi-label problem into one or more binary or multi-class problems, while algorithm adaptation methods adapt a known binary or multi-class approach to deal with the multi-label nature of the problem. A good overview of aspects of multi-label classification is provided by Madjarov et al. (2012). In this paper we present an approach to multi-label classification based on (multivariate) linear regression of the matrix of label indicator values on the matrix of observations of the input variables. Direct application, however, of multivariate linear regression to a multi- label data set ignores the correlations that may exist amongst the label indicator variables. We therefore consider the curds and whey approach proposed by Breiman and Friedman (1997) to incorporate the information provided by these correlation coefficients. This approach has been applied in the literature. We discuss some modifications and illustrate in terms of performance on benchmark data sets.

References

BREIMAN, L. and FRIEDMAN, J.H. (1997): Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society B, 59, 3–54. MADJAROV, G., KOCEV, D., GJORGJEVIKJ, D. and DZEROSKI, S. (2012): An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084–3104.

Keywords

CURDS AND WHEY, MULTI-LABEL

140 Active Multi-Instance Multi-Label learning

Robert Retz1 and Friedhelm Schwenker1

Ulm University, Institute of Neural Information Processing, D-89069 Ulm {robert.retz | friedhelm.schwenker}@uni-ulm.de

Abstract. Multi-Instance Multi-Label learning (MIML) introduced by Zhou and Zhang in [1] is a comparatively new framework in machine learning with two special characteristics: Firstly, each instance is represented by a set of feature vectors (a bag of instances), and secondly, bags of instances may belong to many classes (a multi-label). Thus, a MIML classiﬁer receives a bag of instances and produces a multi-label. For classiﬁer training, the training set is also of this MIML structure. Labeling a data set is always cost-intensive, especially in a MIML framework. In order to reduce the labeling costs it is important to restructure the annotation process in such a way that the most informative examples are labeled in the beginning, and less or non-informative data more to the end of the annotation phase. Partially supervised learning, especially active learning is a possible approach to tackle these kind of problems [2]. In this work we focus on the MIML-SVM algorithm [1] in combination with the k-Medoids clustering algorithm to transform the multi-instance to a single instance representation. For the clustering distance measure we consider variants of the Hausdorff distance, namely Median- and Average-Based Hausdorff distance. Finally, active learning strategies derived from the single-instance scenario have been investigated in the MIML setting and evaluated on several benchmark data sets.

References

[1]Zhou, Z. and Zhang, M. (2007): Multi-instance multi-label learning with application to scene classiﬁcation. In: In Advances in Neural Information Processing Systems, 19, 1609–1616. [2]Settles, B. (2009): Active learning literature survey, Technical report, Department of Com- puter Sciences, University of Wisconsin-Madison, Madison, WI.

Keywords

MULTI-INSTANCE MULTI-LABEL, ACTIVE LEARNING

141 Generation of Datasets for Label Ranking

Massimo Gurrieri1, Philippe Fortemps1, Xavier Siebert1, Marc Pirlot1, Nabil Ait Taleb1, and Yves Desmet2

1 University of Mons, Facult Polytechnique, Rue du Houdain 9, 5002 Mons, Belgium 2 Service de Mathmatiques de la Gestion, Universit Libre de Bruxelles, Boulevard du Triomphe, 1050 Brussels, Belgium

Abstract. Existing data sets for label ranking have been derived from machine learning data sets and are essentially multi-class and regression data sets that were turned into label ranking data. As for classification data, the procedure consists in training a naive Bayes classifier on the complete data set. For each training instance, all labels present in the data set are then ordered w.r.t. the predicted class probabilities. As for regression data, a certain number of (numerical) attributes were removed from the set of predictors and are accordingly considered as labels. Finally, to obtain a ranking for each instance, the (removed) attributes are standardized and sorted by decreasing order of their values. However, in view of the lack of benchmark data for label ranking, we are currently investigating methods for generating artificial data sets more suitable for Label Ranking and that contain furthermore correlations between labels. In this work, we present methods based on multi-criteria decision making and on bayesian networks.

References

Hllermeier, E., Frnkranz, J., Cheng, W. and Brinker, K.: Label Ranking by learning pairwise preference. Artif. Intell. 172 (16-17), 1897-1916, (2008). Cheng, W., Hhn, J., Hllermeier, E.: Decision Tree and Instance-Based Learning for Label Ranking. Proc. ICML-09, International Conference on Machine Learning. Montreal, Canada, (2009). Gurrieri, M., Fortemps, P., Siebert, X., Greco, S., Matarazzo, B., Słowinski,´ R.: Label Rank- ing: A New Rule-Based Label Ranking Method, Advances on Computational Intelligence Communications in Computer and Information Science Volume 297, 613-623, (2012).

Keywords

Preference Learning, Machine Learning, Label Ranking.

142 Subset Correction for Multi-Label Classiﬁcation

Robin Senge and Eyke Hullermeier¨ 1

Computer Science Department, University of Paderborn, Germany

Abstract. In contrast to conventional multi-class classification, where an instance or object belongs to exactly one class y ∈ Y = {y1,...,ym}, multi-label classification (MLC) allows an instance to belong to several classes simultaneously (Tsoumakas and Katakis, 2007). Formally, the output space Y is thus replaced by the power set Y = 2Y . Correspondingly, methods for MLC are supposed to produce predictions in the form of subsets Y ∈ Y. In practice, the number of label combinations that are actually observed in MLC data sets is often only a tiny fraction of the total number |Y| = 2m of theoretically possible subsets. Moreover, if a label combination y has never been observed in the training data, one may suspect that it does not exist at all (and should hence not be predicted). Although the validity of this argument strongly hinges on the total number of observations seen so far (and can never be proved with certainty), it motivates our idea of subset correction, which restricts a learner to the prediction of label combinations whose existence is testified by the (training) data. More specifically, we propose two approaches to subset correction, one based on probabilistic conditioning (with probabilistic classifier chains (Dembczynski et al., 2010) as classifiers) and the other one using distance-based approximation (replacing an unconstrained prediction y produced by the original learner with a “most similar” testified subset y∗). We analyze these approaches both theoretically and empirically. The main question we seek to answer is the following: Under what conditions and for which MLC loss functions is subset correction able to improve the (average) predictive accuracy?

References

Dembczynski, K., Cheng, W., and Hullermeier,¨ E. (2010). Bayes Optimal Multilabel Classi- fication via Probabilistic Classifier Chains. Proc. ICML–2010, International Conference on Machine Learning, Haifa, Israel. Tsoumakas, G., and Katakis, I. (2007). Multi label classification: An overview. Int. Journal of Data Warehousing and Mining, 3(3):1–13.

143 Supervised Classiﬁcation of Viral Genomes based on Restriction Site Distribution

Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Banire´ Diallo

Department of Computer Science, Universite´ du Quebec´ a` Montreal,´ P.O. Box 8888 Downtown Station, Montreal, Quebec, H3C 3P8, Canada. [email protected]

Abstract. Over the last decade, advances in sequencing technologies have led to a better knowledge on genomic and taxonomic characteristics of viruses. Due to the volume of new sequenced genomes in metagenomic and viral multi-infection data, it is important to provide efficient methods to genotype and classify the involved viruses (identify the type, class, species and/or gender). Molecular biology techniques such as the one based on Restriction Fragment Length Polymorphism (RFLP) [1] are powerful, but limited and expensive to be applied to thousands of genomes. Here, we modelled the RFLP technique to fit a computational framework. We propose an original approach of genotyping that exploits supervised machine learning methods on restriction site fragment distributions. To this end, we combined a set of 516 different types of attributes on the restriction site distributions for 3 viral datasets (Papilloma Viruses (PV), Hepatitis B viruses(HBV) and Human Immunodeficiency viruses(HIV)), containing more than 3000 whole genomes. We assessed the approach on 7 kinds of supervised classifiers such as decision tree based algorithms, SVM, KNN with a 10- fold cross-validation. The classification performance of divergent viral sequences (inter-viral) and conserved viral sequences (inter-gender and inter-species) highlights correct prediction of 96% for inter-species, and 99% for inter-gender as well as inter-viral classificatiosn in PV genomes. Similar trends have been found in HBV and HIV. With high prediction rates and ro- bustness, as well as rapidity, such an approach will be essential in all large scale viral studies.

References

Saiki, R.K., Scharf, S., Faloona, F., Mullis, K.B., Erlich, H.A., Arnheim, N. (1985): Enzymatic ampliﬁcation of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230 (4732): 13501354.

Keywords

CLASSIFICATION, VIRUS GENOMES, GENOTYPING, SUPER- VISED LEARNING, KNOWLEDGE DISCOVERY

144 26 CON-4B: Visualization and Graph Models

Friday, July 4, 2014: 08:30 - 10:35, West Hall 3 Uncertainty in Medical Data Analysis in the Case of Carotid Vessel Visualization

Gordan Ristovski1, Tobias Preusser12, Horst Hahn12, and Lars Linsen1

1 Jacobs University, Brmeen, Germany 2 Fraunhofer MEVIS, Bremen, Germany

Abstract. The medical visualization pipeline ranges from medical imaging processes over several data analysis steps to the ﬁnal rendering output. Each of these steps induce a certain amount of uncertainty based on errors or assumptions. The rendered images typically omit this information and allude to the fact that the shown information is the only possible truth. Med- ical doctors may base their diagnoses and treatments on these visual representations, which can lead to an unnecessary surgical interventions [1]. However, many decisions made in the visualization pipeline are sensitive to small changes. A compelling example is that of vessel visualization for detection and diagnosis of vessel abnormalities such as stenoses (an abnormal narrowing of the vessels). To allow for a proper assessment of the data by the medical experts, the uncertainty that is inherent to the displayed information needs to be revealed. This is the task of uncertainty visualization. Recently, many approaches have been presented to tackle uncertainty visualization including a few techniques in the context of medical visualization, but they typically address one speciﬁc problem. In order to comprehensively understand what types of uncertainty exist in medical visualization and what their characteristics in terms of mathematical models are, we have built a taxonomy of uncertainty types by categorizing the types in an abstract form and describing them mathematically in a rigorous way [2]. In this talk, we identify uncertainties related to the pipeline of carotid vessel visualization and relate the uncertainty types to our taxonomy. We discuss the visualization challenges and imvestigate at the effectiveness of the existing visualization techniques that are applicable to the carotid setting for each relevant type. We present novel alternative visualization approaches that we have developed and compare them to the state of the art techniques.

References

[1] C. Lundstrom¨ and P. Ljung and A. Persson and A. Ynnerman (2007): Uncertainty Visual- ization in Medical Volume Rendering Using Probabilistic Animation IEEE Transactions on Visualization and Computer Graphics, volume 13, issue 4, 1648–1655. [2] G. Ristovski and T. Preusser and H. K. Hahn and L. Linsen (2014): Uncertainty in medical visualization: Towards a taxonomy Computers & Graphics, volume 39, 60–73.

146 Uncertainty in Medical Data Analysis in the Case of Carotid Vessel Visualization 147 Keywords

Taxonomies, Medical Data Analysis, Uncertainty Visualization, Biomed- ical and Medical Visualization Visual Analysis of Multi-run Spatio-temporal Simulation Data

Alexey Fofonov1 and Lars Linsen2

1 Jacobs University, Bremen, Germany [email protected] 2 Jacobs University, Bremen, Germany [email protected]

Abstract. Multi-run simulations are widely used to investigate how simulated processes evolve depending on varying initial conditions. Frequently, such simulations model the change of 2D or 3D spatial phenomena over time. Visual representation and analysis of such multi- run spatio-temporal data require fast and robust computational algorithms as well as novel visualization approaches. Isocontours are commonly used visual representations for 2D or 3D spatial data visualization and have proven to be effective for the analysis of scalar fields. We propose a novel visualization approach for multi-run simulation data based on isocontours. The approach is sensitive to shapes of isocontours, therefore it is possible to effectively capture all information about the simulations instead of just using means or other statistical data. By introducing a distance function for isocontours, we generate a distance matrix used for multidimensional scaling projection. Multiple simulation runs are represented by polylines in the projected view displaying change over time. We propose a fast calculation of isocontour differences based on a quasi-Monte Carlo approach, which can be run in parallel for efficient processing. For interactive visual analysis, we provide filtering and selection mechanisms on the multi-run plot as well as linked views to physical space visualizations. Our approach can be effectively used for the visual representation of ensembles, for pattern and outlier detection within the multi-run data set, for the investigation of the influence of simulation parameters, and for a detailed analysis of detected features. The proposed method is applicable to data of any spatial dimensionality and any spatial representation (gridded or unstructured). To show the main benefits of the proposed technique we consider generated synthetic data and discuss all stages of their analysis in details. We validate our approach by applying it to different types of multi-run spatio-temporal data stemming from climate modeling and astrophysical simulations with further feedback from field specialists to estimate the efficiency of the proposed methods.

Keywords

Coordinated and Multiple Views, Extraction of Isosurfaces, Time- varying Data, Integrating Spatial and Non-Spatial Data Visualization

148 Size and shape effects in biplots

Michael Greenacre1

Universitat Pompeu Fabra, Barcelona, Catalonia, Spain [email protected]

Abstract. Biplots can be thought of as a multidimensional scaling of a distance matrix between samples followed by a set of linear regressions of the variables on the dimensions of the solution to obtain biplot axes. In principal component analysis (PCA) biplots, which rely on Euclidean distances between samples, the ﬁrst dimension is often called a “size” dimension, when all the variables have coordinates with the same sign on the dimension. Subsequent dimensions are generally “shape” dimensions, but include a certain componetn of size as well, since the size effect may not be totally concentrated into the ﬁrst dimension. In correspondence analysis (CA) biplots, which rely on the chi-square distance, the size effect has been partialled out from the outset because relative values are visualized, hence all dimensions are dimensions of shape. In practice, several other distance and dissimilarity measures may be used and it is not immediately clear whether the dimensions extracted pertain to effects of size or shape. For example, the Bray-Curtis dissimilarity is ubiquitous in ecological research, but to what extent does this dissimilarity measure include size and shape effects? In this talk I investigate these effects in the context of several distance and dissimilarity measures used in practice and try to come up with a way to measure how much biplots and their dimensions are visualizing size, shape or a mixture of both.

References

GREENACRE, M. (2010): Biplots in Practice. BBVA Foundation, Madrid. Free download from www.multivariatestatistics.org. GREENACRE, M. and PRIMCIERIO, R. (2013): Multivariate Analysis of Ecological Data. BBVA Foundation, Madrid. Free download from www.multivariatestatistics.org.

Keywords

BIPLOT, PRINCIPAL COMPONENT ANALYSIS, CORRESPON- DENCE ANALYSIS, DISTANCE, SIZE, SHAPE

149 Reviewing Graphical Modelling of Multivariate Temporal Processes

Matthias Eckardt1

Institute of Computer Science, Humboldt-Universitat¨ zu Berlin [email protected]

Abstract. Combining probability theory and graph theory graphical models provide a suitable approach of dealing with uncertainty and complexity using conditional independence statements and factorisations of joint densities. Static undirected as well as directed graphical models have been applied frequently to pattern analysis, decision modelling, machine learning or image ﬁltering. Several temporal extension have been published including dynamic Bayesian networks or temporal Markov random ﬁelds. Although, graphical models are most commonly used within computer science there has been a growing interest in graphical modelling in adjacent disciplines. Recently, a few temporal extensions have been applied to multivariate time series data using graphs build on undirected as well as directed edges. Thus, we our talk will review these mixed graphical models which are especially suitable for highly complex temporal data.

References

EICHLER, M. (2012): Graphical Modelling of Multivariate Time Series. Probability Theory and Related Fields, 153, 233-268. KOLLER, D. and FRIEDMAN, N. (2010): Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge Massachusetts. COWELL, R.G. and DAWID, A.P. and LAURITZEN, S.L. and SPIEGELHALTER, D.J.(2010): Probabilistic networks and expert systems. Springer, London.

Keywords

GRAPHICAL MODELS, MULTIVARIATE TIME SERIES, HIGHLY STRUCTURED PROCESSES

150 Learning Hierarchical Document Classiﬁcations from Recommender Graphs: An Application of Modularity Clustering

Fabian Ball1 and Andreas Geyer-Schulz2

1 Karlsruhe Institute of Technology [email protected] 2 Karlsruhe Institute of Technology [email protected]

Abstract. In this article we apply ensemble-based randomized algorithms to several recommender graphs from libraries for scientiﬁc institutions. We study the possibility to learn hierarchical document classiﬁcations from the recommender graphs of these libraries. The basic idea of the approach is to identify permutation invariant subsets of documents at several levels of aggregation.

References

GEYER-SCHULZ, A. and OVELONNE,¨ M. (2012): The Randomized Greedy (RG) Mod- ularity Clustering Algorithm and the Core Groups Clustering (CGGC) Scheme. In: W. Gaul, A. Geyer-Schulz, J. Kunze (Eds.): Proc. German/Japanese Workshops Karlsruhe 2010/Kyoto 2012. Springer, Heidelberg. OVELONNE,¨ M. and GEYER-SCHULZ, A. (2012): An Ensemble Learning Strategy for Graph-Clustering. In: D. A. Bader, H. Meyerhenke, P. Sanders, D. Wagner (Eds.): 10th DIMACS Implementation Challenge – Graph Partitioning and Graph Clustering. DI- MACS, Rutgers University, Piscataway. OVELONNE,¨ M. and GEYER-SCHULZ, A. (2010): Cluster Cores and Modularity Maxi- mization. In: W. Fan, W. Hsu, G. I. Webb, B. Liu, C. Zhang, D. Gunopulos, X. Wu (Eds.): Proceedings of the 10th IEEE International Conference on Data Mining Work- shops (ICDMW-IEEE 10) in Sydney, Australia. IEEE Computer Society, Los Alamitos, 1204–1213.

Keywords

MODULARITY CLUSTERING, RECOMMENDER GRAPHS, DOC- UMENT CLASSIFICATION

151 27 CON-4C: Statistics and Data Analysis V

Friday, July 4, 2014: 08:30 - 10:35, West Hall 4 A comparison study for spectral, ensemble and spectral-mean shift clustering approaches for interval-valued symbolic data

Marcin Pełka1

1 Wrocław University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics and Computer Science. [email protected]

Abstract. Interval-valued data arise in practical situations such as recording monthly interval temperatures at meteorological stations, daily interval stock prices, etc. This paper presents a comparison study for clustering efﬁciency (according to adjusted Rand index) for spectral, ensemble and spectral-mean shifted clustering methods for symbolic data. Evaluation studies with application of artiﬁcial data with known cluster structure (obtained from mlbench and clusterSim packages of R) show the usefulness and stable results of the ensemble clustering compared to spectral and spectral-mean shift method.

References

BOCK, H.-H., DIDAY, E. (Eds.) (2000): Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data. Springer Verlag, Berlin-Heidelberg. FRED, A.L.N., JAIN, A.K. (2005): Combining multiple clustering using evidence accumula- tion. IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 27, 835850. NG, A., JORDAN, M., WEISS, Y. (2002): On spectral clustering: analysis and an algorithm. In: T. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Pro- cessing Systems 14. MIT Press, 849-856. von LUXBURG, U. (2006): A tutorial on spectral clustering. Max Planck Institute for Bio- logical Cybernetics, Technical Report TR–149. WALESIAK, M., DUDEK, A. (2002): The clusterSim package for R software, [URL:] www.r-project.org.

Keywords

SYMBOLIC DATA ANALYSIS, ENSEMBLE CLUSTERING, SPEC- TRAL CLUSTERING

153 Clustering and solar radiance prediction

Henri Ralambondrainy1 and Yves Lechevallier2 Jean-Daniel Lan-Sun-Luk3 J.P. Chabriat3

1 LIM, Universite de la Reunion-97490 Sainte-Clotilde, Reunion [email protected] 2 INRIA, EPI AxIS, Paris-Rocquencourt 78153 Le Chesnay cedex, France [email protected] 3 LE2P, Universite de la Reunion-97490 Sainte-Clotilde, Reunion {lanson,doyensc}@univ-reunion.fr

Abstract. Grid-connected photovoltaic power systems development is encouraged for the Re- union island where solar radiation are strong but inherently intermittent. We present the results of an interdisciplinary research project, which involves researchers in energy, meteorology and data mining, adressing this problem. Data collected from December 2008 to March 2012 concerns solar radiation hour by hour. Prior to prediction modelling, two clustering strategies has been applied and compared for the analyse of the data base of 956 days. The ﬁrst approach combines proven data-mining methods and the second approach is a clustering method that operates on a set of matrices dissimilarity. For the hourly solar radiance prediction, we propose a regression method that looks for local linear models related to the classes of a partition. Results are compared with the regression-tree Breiman method.

References

Breiman, L., J. H. Friedman, R. A. Olshen, et C. J. Stone (1984). Classiﬁcation And Regres- sion Trees. New York: Chapman and Hall. D?Urso, P. et M. Vichi (1998). Dissimilarities between trajectories of a three-way longitudinal data set. In A. Rizzi, M. Vichi, et H.-H. Bock (Eds.), Advances in data science and classiﬁcation., pp. 585?592. Berlin: Springer. L, S., J. Josse, et F. Husson (2008). Factominer: An R package for multivariate analysis. Jour- nal of Statistical Software 25, 1?18.

Keywords

Clustering, Prediction, Regression

154 Moving Functional MDS and its Application to Monitoring Post Data in Fukushima Prefecture

Masahiro MIZUTA1 and Hiroyuki MINAMI2

1 Information Initiative Center, Hokkaido University [email protected] 2 Information Initiative Center, Hokkaido University [email protected]

Abstract. Functional Data Analysis (FDA) is an effective approach to deal with complex and huge data; the objects are represented by functions. In most methods of FDA, the domain of the functions is predeﬁned and ﬁxed. However, the results of FDA sometimes depend on the domain, especially when we use dissimilarities between objects. Mizuta (2004) proposed moving functional clustering, which adopts moving domain or window. In this paper, we propose a novel method for functional data, moving functional multi dimensional scaling. We also show the results of the proposed method with an important data set; monitoring information of environmental radioactivity level in Fukushima prefecture (http://radioactivity.nsr.go.jp/map/ja/download.html).

References

Mizuta, M.(2004): Clustering methods for functional data. Proceedings in Computational Statistics, 1503–1510, Springer. Mizuta, M. and Kato, J. (2007): Functional Data Analysis and its Application. Lecture Notes in Artiﬁcial Intelligence 4481, 228–235, Springer. Ramsay, J. O. and Silverman, B. W. (2005): Functional Data Analysis (2nd ed.). Springer.

Keywords

FDA, SDA, RADIATION DOSE

155 Using annotated sufﬁx tree similarity measure for text summarisation

Maxim Yakovlev1 and Ekaterina Chernyak2

1 NRU HSE [email protected] 2 NRU HSE [email protected]

Abstract. This work describes an attempt to the TextRank [1] algorithm improvement. Tex- tRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with between sentences similarities and connect consequent sentences. The second stage is applying the PageRank algorithm [2] as its is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, specially on measuring the similarities between sentences. In [1] it is suggested to employ the common schema: use the Vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between those vectors. Our idea is to replace this schema by using the annotated suffix trees (AST) [3] model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matches concurrencies. To make our results comparable to other text summarisation works, we experiment with the standard DUC collection. For testing the method on Russian texts we make our own collection based on newspapers articles with some sentences highlighted as being more important.

References

1.MIHALCEA R. and TARAU P. (2004): TextRank: Bringing Order into Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 404-411. 2.BRIN S. and PAGE L. (1998): The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the seventh international conference on World Wide Web 7, 107-117. 3.PAMPAPATHI R., MIRKIN B. AND LEVENE M. (2008): A sufﬁx tree approach to anti- spam email ﬁltering. Machine Learning, 65(1), 309-338.

Keywords

TEXT SUMMARISATION, SIMILARITY MEASURES, SUFFIX TREES

156 Using Hidden Markov Models to Improve Analyzing Accelerometer Data

Norman Wirsik1, Vitali Witowski1, Ronja Foraita1, Yannis Pitsiladis2, and Iris Pigeot1

1 Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany [email protected] 2 University of Brighton, Eastbourne, UK

Abstract. The use of accelerometers to objectively measure physical activity (PA) has become the most preferred method in recent years. Traditionally, cutpoints are used to assign impulse counts recorded by the devices to sedentary and activity ranges. Here, hidden Markov models (HMM) are used in order to improve the cutpoint method to achieve a more accurate identification of the sequence of modes of PA. 1,000 days of labeled accelerometer data have been simulated. For the simulated data the actual sedentary behavior and activity range of each count is known. The cutpoint method is compared with HMMs based on the Poisson distribution (HMM[Pois]), the generalized Poisson distribution (HMM[GenPois]) and the Gaussian distribution (HMM[Gauss]) with regard to misclassification rate (MCR), bout detection, detection of the number of activities performed during the day and runtime. Using simulated data, HMM-based methods were superior with respect to activity classification when compared to the traditional cutpoint method and seem to be appropriate to model accelerometer data. HMM[Gauss] appears to be the most appropriate choice of all HMM-based methods to model real-life accelerometer data.

References

POBER, D. M., STAUDENMAYER J., RAPHAEL, C. and FREEDSON P. S.(2006): Devel- opment of novel techniques to classify physical activity mode using accelerometers. Med Sci Sports Exerc, 38(9):1626-1634.

Keywords

BOUT DETECTION, PATTERN RECOGNITION, PHYSICAL AC- TIVITY PATTERNS, TIME SERIES

157 28 CON-4D: Data Analysis in Social Sciences III

Friday, July 4, 2014: 08:30 - 10:35, West Hall 5 Making Sense of Qualitative Data: An Application of the Gioia Method

Fabiola H. Gerpott1 and Sven C. Voelpel2

1 Jacobs University Bremen, VU Amsterdam [email protected] 2 Jacobs University Bremen [email protected]

Abstract. In business research, qualitative research has often been accused of lacking objec- tivity, resulting in creative theorizing based on questionable data (Gioia et al., 2013). Never- theless, qualitative methods possess a unique potential to provide distinctive insights regarding elusive organizational processes and phenomena. Yet, the richness of data collected in qualitative designs can be difficult to structure in a purposeful way to be able to develop original contributions and achieve scientific and practical value (Corley and Gioia, 2011). Offering a solution for these challenges, Gioia and colleagues (cf. Gioia et al., 2013) developed a systematic approach providing scholars with a structured process to execute data collection and analysis. Based on grounded theory strategy (Glaser and Strauss, 1967), data are gathered in the field and evaluated by the researcher in a recursive process. Finally, information is structured into 1st order concepts representing the voices of the informants, 2nd order themes classifying the 1st order concepts into broader topics as well as aggregated dimensions. We apply this methodology to investigate intergenerational learning in a training program of a large automobile manufacturer. Our results affirm the utility of the Gioia approach for connecting field data with theoretical concepts. We discuss limitations and future avenues for the pursuance of structured approaches in qualitative research.

References

CORLEY, K. G. and GIOIA, D. A. (2011): Building theory about theory building: What con- stitutes a theoretical contribution? Academy of Management Review, 36, 12-32. GIOIA, D. A. et al. (2013): Seeking qualitative rigor in inductive research: Notes on the Gioia Methodology. Organizational Research Methods, 16, 15-31. GLASER, B. G. and STRAUSS, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Chicago, IL: Aldine.

Keywords

GIOIA METHOD, QUALITATIVE RESEARCH, GROUNDED THE- ORY

159 Applying Multilevel Path Analysis: Analyzing the Role of Leaders’ Work Engagement for Subordinates’ Work Engagement

Daniela Gutermann1, Nale Lehmann-Willenbrock2, Diana Boer3 and Sven C. Voelpel4

1 Jacobs University Bremen, Campus Ring, 1, Bremen, Germany [email protected] 2 VU University Amsterdam, Van der Boechorststraat 1, 1081 BT Amsterdam [email protected] 3 Goethe University Frankfurt, PEG, Gruneburgplatz¨ 1, 60323 Frankfurt am Main, Germany [email protected] 4 Jacobs University Bremen, Campus Ring, 1, Bremen, Germany [email protected]

Abstract. Work and organizational psychological data is often hierarchically structured. Em- ployees are part of teams, various teams comprise departments which in turn make up organizations. One special attribute of such a nested data structure is that the single observations are not independent. Multilevel modeling is an appropriate technique that considers this data structure analyzing the different levels simultaneously. Furthermore, multilevel path analysis can analyze complex direct and indirect relationships on different levels. We conducted multilevel path analysis in MPlus 6.1 to answer the question if a leader’s work engagement in- ﬂuences the work engagement of his or her subordinates. In order to test this, we surveyed 710 employees of a German service organization, working in 104 teams and their leaders. Results show that leaders’ work engagement is related to positive Leader-Member-Exchange which in turn facilitates subordinates’ work engagement. Moreover, subordinates’ work engagement is positively associated with their performance and reduced intention to leave.

References

NEZLEK, J.B. (2011): Multilevel modeling for social and personality psychology. SAGE Pub- lications Ltd.

Keywords

MULTILEVEL PATH ANALYSIS, NESTED DATA, WORK EN- GAGEMENT

160 Mentoring in Context: An Application of Multilevel Mediation Models in Organizational Research

Doris Rosenauer1a, Annelies E. M. Van Vianen2b, Astrid C. Homan2c, Christiane A. L. Horstmeier1d, and Sven C. Voelpel1e

1 School of Humanities and Social Sciences, Jacobs University Bremen, Campus Ring 1, D-28759 Bremen, Germany [email protected],[email protected], [email protected] 2 Work and Organizational Psychology, University of Amsterdam, Weesperplein 4, 1018 XA Amsterdam, The Netherlands [email protected], [email protected]

Abstract. Only recently scholars have begun to conceptualize leader behaviors in more complex group contexts, such that leadership can represent different concepts at different levels. We apply these new perspectives to mentoring provided by supervisors and distinguish between individual-level differentiated mentoring (i.e., the deviation of an employee’s individual perceptions from the average perception within the group) and group-level mentoring (i.e., the average perception across all group members). Testing a latent multilevel mediation model in a sample of 290 vocational trainees and their supervisors, we ﬁnd that career motivation mediates the positive relationship between differentiated psycho-social mentoring and promotability, whereas job satisfaction mediates the positive effect of differentiated career and psycho-social mentoring on intentions to stay. At the group level, only career mentoring is positively related to promotability and intentions to stay. Multilevel mediation techniques help to disentangle the different mechanisms through which mentoring functions operate within groups: Career mentoring seems to operate mainly at the group level, whereas psycho-social mentoring operates mainly at the individual level. We discuss practical and theoretical implications.

Keywords

MULTILEVEL MEDIATION, MENTORING, CAREER DEVELOP- MENT

161 Nonhierarchical Asymmetric Cluster Analysis of Relationships Among Managers at a Firm

Akinori Okada1 and Satoru Yokoyama2

1 Graduate School of Management and Information Sciences Tama University, 4-1-1 Hijirigaoka Tama-shi Tokyo Japan 206-0022 [email protected] 2 Department of Business Administration Faculty of Economics Teikyo University, 359 Otsuka Hachioji City Tokyo Japan 192-0395 [email protected]

Abstract. A nonhierarchical asymmetric cluster analysis which has different goodness of fit measure from Okada & Yokoyama (2013, submitted) are applied to analyze relationships among managers at a firm (Krackhardt, 1987). It can can deal with asymmetric relationships among objects. The nonhierarchical asymmetric cluster analysis classified objects into clusters, where the number of clusters is determined beforehand. Each cluster consists of a dominant or central object and objects which are dominated by the dominant or central object. The present data represent relationships of giving advices among 21 managers. Twenty-one managers consist of one president, four vice president, and supervisors, where each vice president heads up a department, and each supervisor belongs to one of four department. The present nonhierarchical asymmetric cluster analysis result shows good correspondence with the organization of the firm, and compared with the result obtained by using the earlier goodness of fit measure.

References

KRACKHARD, D. (1987): Cognitive Social Structures. Social Networks, 9, 109–134. OKADA, A. and YOKOYAMA, S. (2013): Nonheirarchical asymmetric cluster analysis. Abstract retrieved September 18, 2013, from http://www.cladag2013.it/images/file/CLADAG2013_Abstract.pdf, 353. OKADA, A. and YOKOYAMA, S. (submitted): Nonhierarchical Asymmetric Cluster Analy- sis. Submitted to the conference volume of the 9th Scientiﬁc Meeting of the Classiﬁcation and Data Analysis Group of the Italian Statistical Society.

Keywords

ASYMMETRY,CLUSTER ANALYSIS, NONHIERARCHICAL, SO- CIAL NETWORK

162 Identiﬁcation of digital skills proﬁles using ICT usage data

Dominik Antoni Rozkrut12

1 Statistical Ofﬁce in Szczecin, Poland [email protected] 2 University of Szczecin, Depatment of Statistics, Poland

Abstract. Skills related to information, media, ICT, data, network literacy, digital citizen- ship, e-health play an important role in a persons ability to perform tasks related to education, work, culture, life. These competency areas are related to the cognitive abilities necessary in critical thinking. Unfortunately, research indicates many people are at a disadvantage across these skills, and thus it makes it an important policy issue. A number of competency frameworks were proposed so far, focusing on a specific sets of skills. Digital Agenda postulates the DIGCOMP framework, developed by EAC & JRC, for EU-wide indictors of digital competence measuring digital skills. This paper takes another approach to measurement by trying to empirically identify competency profiles. The practical questions arise, whether it is possible to identify actual empirical competency profiles, if so, what are these profiles, how do the actual profiles align with the proposed framework and finally, what are the competencies of analyzed population (regardless of applied frameworks and their measurement scales). The notion and definition of digital skills profile is proposed in the paper. The aim of the paper is to propose and test a procedure for identification of digital skills profiles using the data from a household surveys on ICT usage. While there exists a number of various clustering methods, it is necessary to consider their features with regards to the effective extraction of digital skill profiles.

References

GATNAR, E., WALESIAK, M. (Eds.) (2004): Metody statystycznej analizy wielowymiarowej w badaniach marketingowych [Multivariate statistical analysis methods in marketing research]. Wydawnictwo AE, Wroclaw, 35–38. WALESIAK, M. and DUDEK, A. (2014): clusterSim: Searching for optimal clustering procedure for a data set. URL http://www.R-project.org/. R package version 0.43-4.

Keywords

DIGITAL SKILLS, ICT USAGE, CLUSTER ANALYSIS, FACTOR ANALYSIS

163 29 CON-4E: Data Analysis in Finance II

Friday, July 4, 2014: 08:30 - 10:35, West Hall 6 Facilitating household ﬁnancial plan optimization by adjusting time range of analysis to life-length risk aversion

Lukasz Feldman1, Radoslaw Pietrzyk2, and Pawel Rokita3

1 Wroclaw University of Economics [email protected] 2 Wroclaw University of Economics [email protected] 3 Wroclaw University of Economics [email protected]

Abstract. In this article optimization of financial plan for a two-person household is tackled. The two-person model reflects main features of any household and still remains tractable. As compared with an individual living on her or his own, some new specific complications arise however. At least three differences are of fundamental significance. The first is the possibility of cost and income sharing that enables substantial reduction of joint amount of pension plan contributions. The second is an additional kind of life-length risk, namely – premature-death risk. The third is the second dimension of the survival process. Even if it is assumed that univariate survival processes of the household members are independent, the interconnections between them originate on the level of consumption. This feature is also responsible for path- dependence of consumption process. Due to this, the model can not be based on conditional survival probability given the state of the household at a moment. The whole trajectory of the survival process (and consumption process built on it) needs to be taken into consideration. The number of future scenarios (trajectories) grows fast with the length of the remaining period of analysis. This article proposes reduction of the number of scenarios to be analyzed by an original interpretation of life-length risk aversion measures. In addition, risk aversion parameters created for the needs of this approach are intuitive and easily applicable.

References

CAMPBELL, J.Y. (2006): Household Finance. Journal of Finance, Vol. LXI, no. 4 (Aug.), 1553–1604. YAARI, M.E. (1965): Uncertain Lifetime, Life Insurance and Theory of the Consumer, The Review of Economic Studies, Vol. 32(2), 137–150.

Keywords

FINANCIAL PLANNING, HOUSEHOLD FINANCE, LIFE-LENGTH RISK

165 Constructing cumulated net cash ﬂow scenarios with an underlying two-person survival process for household ﬁnancial planning

Pawel Rokita

Wroclaw University of Economics [email protected]

Abstract. For practical reasons it is sometimes convenient to assume that consumption of a household is planned at a given level and it is deterministic, whereas the feasible consumption is the minimum of planned consumption and liquid financial means. Then, all financial plans sustaining planned consumption are undistinguishable in respect of consumption itself, but they may generate different amounts of surplus. Under assumptions of the model that is proposed here, dynamics of cumulated surplus best reflects financial situation of the household. Classical expected discounted utility of consumption can not be used here as a goal function for financial plan optimization. Moreover, further evolution of cumulated surplus process from a given moment depends not only on the number of household members that are alive at the moment, but (if some of them is dead) also on how long time ago that person died and whether it was before or after his or her retirement age. This makes it necessary to analyze the whole trajectory of the process. The article presents a model describing financial situation of a two- person household in discrete time. Scenarios of bidimensional survival process are generated and it is discussed how to construct trajectories of household cumulated surplus on this basis. Then a short discussion of applications of the model in household financial plan optimization is presented.

References

FELDMAN, L., PIETRZYK, R., ROKITA, P. (2014): A practical method of determining longevity and premature-death risk aversion in households and some proposals of its application. In: M. Spiliopoulou, L. Schmidt-Thieme, R. Janning (Eds.): Data Analysis, Machine Learning and Knowledge Discovery. Berlin-Heidelberg: Springer, 255–264. YAARI, M.E. (1965): Uncertain Lifetime, Life Insurance and Theory of the Consumer, The Review of Economic Studies, Vol. 32(2), 137–150.

Keywords

HOUSEHOLD FINANCE, INCOMPLETE RETIREMENT, LIFE CY- CLE

166 Firm-speciﬁc determinants on dividend changes: insights from data mining

Karsten Luebke1 and Joachim Rojahn2

1 FOM Hochschule fur¨ Oekonomie und Management, c/o B1st software factory, Rheinlanddamm 201, 44139 Dortmund, Germany [email protected] 2 DIPS Deutsches Institut fur¨ Portfolio-Strategien gGmbH, Leimkugelstraße 6 45141 Essen, Germany [email protected]

Abstract. An accurate prediction of dividend changes is of special interest since such announcements are said to affect stock prices significantly e.g. due to their information content. We extend previous research in this field by adding the level as well as the changes of additional firm specific variables to explain variations in the dividend payout behavior of German firms. In order to identify the important factors for the announcements of dividend changes we compare Data Mining techniques like Decision Trees or Random Forests with classical methods like Mulinomial Logit. The comparison is done on Bloomberg Terminal data of the dividend payout of German Prime Standard Issuers during the years 2007-2010. In addition to the insight in the dividend policy also the prediction performance of the different methods is analysed.

References

GOERGEN, M., RENNEBOOG, L. and DA SILVA,C. (2005): When do German firms change their dividend? Journal of Corporate Finance, 11, 375–399. PAYNE, B.C. (2011): On the financial characteristics of firms that initiated new dividends during a period of economic recession and financial market turmoil. Journal of Economics and Finance, 149–163. WEIHS, C. and LUEBKE, K. (2009): Prediction optimal classification of business phases. In: A. Wagner (Eds.): Empirische Wirtschaftsforschung heute. Schaffer-Poeschel,¨ Stuttgart, 149-156.

Keywords

DIVIDEND POLICY, MULTINOMIAL LOGIT, DECISION TREE, RANDOM FOREST

167 Excess Takeover Premiums and Takeover Contests - An Analysis of Different Approaches for Determining Abnormal Offer Prices

Wolfgang Bessler1 and Colin Schneck2

1 Center for Finance and Banking, Justus-Liebig University Giessen, Licher Strasse 74, 35394 Giessen, [email protected] 2 Center for Finance and Banking, Justus-Liebig University Giessen, Licher Strasse 74, 35394 Giessen, [email protected]

Abstract. In this study we analyze the relationship between offering an excess takeover premium and the occurrence of takeover contests. Our objective is first to compare three different approaches for calculating excess premiums and second to test whether or not excess takeover premiums prevent the occurrence of a takeover contest. We hypothesize that only an excess premium above the industry mean or country mean will deter post-bid competition. Previous studies find mixed evidence for the relationship between the size of takeover premiums and the occurrence of takeover contests. We analyze how excess premiums can be correctly determined and how the calculation affects the observed interaction between takeover premium and post-bid competition. We extend the literature by using three different approaches to measure the excess premium and analyze the impact on the occurrence of takeover contest. First we calculate the excess premium as the percentage 1) above the pre-offer market value of the target, 2) over the industry mean and 3) over the country mean. Second we investigate the effect of these three different calculations of excess premium on the occurrence of takeover contest. Our results suggest that the calculation method significantly affects the classification of excess premiums in takeover contest. We provide evidence that when using the industry excess premiums, an above average premium reduces the probability that a takeover contest occurs especially in cash deals, whereas the standard calculation method is not suitable for correctly discriminating between average and excess premiums.

Keywords

TAKEOVER CONTEST, COMPETITION, EXCESS TAKEOVER PRE- MIUM, MERGERS AND ACQUISITIONS

168 Interval estimation of Value-at-Risk and Expected Shortfall for ARMA-GARCH models

Krzysztof Piontek

Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected]

Abstract. Model risk is inevitable in many financial issues. The potential threat stems from ignorance of this risk existence, its magnitude and consequence. It arises mainly from incorrect assumptions and parameter uncertainty (e.g. short samples). Due to the model risk, any quantile-based risk measure value is a random variable, and may be estimated using point or interval procedure. Interval estimation approach is extremely rare. Research presents issues related to the model risk in the process of Value-at-Risk and Ex- pected Shortfall estimation as a result of the incorrect parameter estimation one-dimensional AR-GARCH models. The aim of this study is to illustrate the width of VaR and ES confidence intervals for typical cases. For chosen financial time series models, author calculated confidence intervals for VaR and ES based on the series with different numbers of observations. Intervals were determined on the basis of a two-stage bootstrap procedure taking into account both the errors in estimating the parameters of the AR-GARCH models, as well as the distribution of the standardized quantile residuals, without assumptions about the form of the conditional distribution. Simu- lated and market data were used. The results of this study are important for backtesting of the quantile-based risk measures. It is an extension of the previous studies of the author.

References

CHRISTOFFERSEN, P., GONALVES, S. (2005): Estimation risk in ﬁnancial risk management, Journal of Risk, Vol. 7, No. 3, pp. 1–28 HANSEN, B. (2006): Interval forecasts and parameter uncertainty, Journal of Econometrics 135(1-2), pp. 377–398 LNNBARK, C. (2013): On the role of the estimation error in prediction of expected shortfall, Journal of Banking & Finance, Volume 37, 3, pp. 847–853

169 170 Krzysztof Piontek Keywords

VaR, ES, ARMA-GARCH, MODEL RISK, INTERVAL ESTIMA- TION 30 CON-5A: Machine Learning and Knowledge Discovery VI

Friday, July 4, 2014: 11:00 - 12:40, West Hall 2 Analysing Psychological Data by Evolving Computational Models

Peter C. R. Lane1, Peter D. Sozou2, Fernand Gobet3, and Mark Addis4

1 School of Computer Science, University of Hertfordshire, College Lane, Hatﬁeld AL10 9AB, United Kingdom [email protected] 2 Department of Psychological Sciences, University of Liverpool, Bedford Street South, Liverpool L69 7ZA, United Kingdom [email protected] 3 Department of Psychological Sciences, University of Liverpool, Bedford Street South, Liverpool L69 7ZA, United Kingdom [email protected] 4 Faculty of Performance, Media and English, Birmingham City University, City North Campus, Perry Barr, Birmingham B42 2SU, UK [email protected]

Abstract. We present a system to represent and discover computational models to capture data in psychology. The system uses a Theory Representation Language to define the space of possible models. This space is then searched using Genetic Programming (GP) (Koza, 1992), to discover models which best fit the experimental data. The aim of our semi-automated system is to analyse psychological data and develop explanations of underlying processes. Some of the challenges include: capturing the psychological experiment and data in a way suitable for modelling, controlling the kinds of models that the GP system may develop, and interpreting the final results. We discuss our current approach to all three challenges, and provide results from different examples, including delayed-match-to-sample (Lane, Sozou, Addis & Gobet, 2014) and tasks in visual attention (for example, Kornblum, 1969).

References

KORNBLUM, S. (1969): Sequential Determinants of Information Processing in Serial and Discrete Choice Reaction Time. Psychological Review, 76, 113–131. KOZA, J.R. (1992): Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press. LANE, P.C.R., SOZOU, P.D., ADDIS, M. and GOBET, F. (2014): Evolving Process-Based Models from Psychological Data using Genetic Programming. In Proceedings of AISB.

Keywords cognitive modelling, genetic programming, knowledge discovery, psychology

172 Information Theoretic Measures for Ant Colony Optimization

Gunnar Vlkel1, Markus Maucher2, Christoph Mssel2, Uwe Schning1, and Hans A. Kestler2

1 Institute of Theoretical Computer Science, Ulm University, Germany {gunnar.voelkel,uwe.schoening}@uni-ulm.de 2 Medical Systems Biology, Ulm University, Germany {markus.maucher,christoph.muessel,hans.kestler}@uni-ulm.de

Abstract. The Ant Colony Optimization (ACO) metaheuristic is a family of algorithms to solve combinatorial optimization problems. ACO algorithms construct solutions based on a series of probabilistic decisions. The probability distributions underlying these decisions are modified after each iteration of the algorithm. To analyze and compare different ACO variants, we apply information theoretic measures that capture the internal state of the algorithms. We demonstrate that the mutual information between the measures and the solution quality can aid in making proper design and parameter choices. We propose summarizing the internal state of ACO algorithms by the mean entropy and entropy variance of the random variables. The mean entropy quantifies the exploration capabilities of the algorithm in a specific iteration. We also study measures of the pheromone update function. We investigate these measures with different variants of ACO algorithms applied to the Vehicle Routing Problem with Time Windows. The measures show a clear distinction of different ACO variants on the investigated problem instances.

References

DORIGO, M. and STTZLE, T. (2009): Ant Colony Optimization: Overview and recent advances. Brussels, Belgium. IRIDIA, Universit Libre de Bruxelles VLKEL, G. and MAUCHER, M. and KESTLER, H. A. (2013): Group-Based Ant Colony Op- timization. In: Proceeding of the ﬁfteenth annual conference on genetic and evolutionary computation conference. GECCO ’13. Amsterdam, The Netherlands. ACM; 121–128.

Keywords information theoretic measures, ant colony optimization, entropy, combinatorial optimization

173 Modelling sea surface temperature in the Indian Ocean using gamboostLSS

Miftahuddin1, Marwa Baeshen1, Adi Florea1, Stavros Poupakis1, Benjamin Hofner2, Andreas Mayr2, and Berthold Lausen1

1 Department of Mathematical Sciences, University of Essex, UK [email protected] 2 Department of Biometry and Epidemiology, University of Erlangen-Nuremberg

Abstract. In recent years, we have observed an increase in the global temperature. This increase caused changes in the climate of the Earth. The climate system is influenced by a large number of variables. Sea surface temperature (SST) is one of the important variables to describe regional and global climate variability. We analyse for 1231 observations between November 2006 and September 2012 the relationship between SST (in celcius) obtained at a buoy in the Indian Ocean and air temperature (in celcius), humidity (in percentage) and rainfall (in millimeter) obtained on Sumatra. Following Magnus et al. (2012) we use a linear regression model to describe the underlying basic relationship. To allow for a nonlinear functional relationship and autocorrelation we fit generalized additive models with P-spline basis and boosting for model fitting. We assess the fit of the model using gamboostLSS (Mayr et al., 2012) which allows for a more general model of Location, Scale and Shape (LSS). Using Akaike’s information criterium and cross validation we observe that the gamboostLSS model fits the data better.

References

MAGNUS, JAN R., MELENBERG, B., and MURIS, C. (2012): Global Warming and Local Dimming: The Statistical Evidence. Journal of the American Statistical Association, 106, 452– 464. MAYR, A., FENSKE, N., HOFNER, B., KNEIB, T., and SCHMID, M. (2012): Generalized Additive Models for Location, Scale and Shape for High Dimensional Data - a Flexible Approach Based on Boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61, 403–427.

Keywords

CLIMATE CHANGE, GENERALIZED ADDITIVE MODELS FOR LOCATION, SCALE AND SHAPE

174 Estimating age- and height-dependent percentile curves for children using GAMLSS in the IDEFICS study

Timm Intemann, Hermann Pohlabeln, Diana Herrmann, Wolfgang Ahrens, and Iris Pigeot

Leibniz Institute for Prevention Research and Epidemiology - BIPS GmbH, Achterstr. 30, 28359 Bremen, Germany. [email protected]

Abstract. Age-dependent growth curves are widely used in medical diagnostic for assessing the health status of children. However, till now there are no reference ranges for a number of clinical parameters in children. To fill this gap, the IDEFICS study provides an excellent database with 18.745 children aged 2.0 – 10.9 years. The generalised additive model for location, scale and shape (GAMLSS) was used to model the influence of various covariates on such clinical parameters. GAMLSS is an extension of the LMS method and is able to model particularly the kurtosis using different distributions. Due to the complexity of GAMLSS different statistical tools, i.e. Bayesian Information Criterion (BIC), Q-Q plots and wormplots were applied to assess the goodness of fit of several models. GAMLSS has proven to be a useful tool to model the influence of more than one covariate when deriving age- and sex-specific percentile curves for clinical parameters in children. This will be demonstrated in this talk exemplarily for bone stiffness where percentile curves were calculated for boys and girls based on the model showed the best goodness of fit accounting for age and height.

References

STASINOPOULOS, D. and RIGBY, R. A. (2007): Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1-46. COLE, T. J., STANOJEVIC, S., STOCKS, J., COATES, A. L., HANKINSON, J. L., and WADE, A. M. (2009): Age- and size-related reference ranges: A case study of spirometry through childhood and adulthood. Statistics in medicine, 28(5), 880–898.

Keywords

GAMLSS, PERCENTILE CURVES, BONE STIFFNESS

175 31 CON-5B: Data Analysis in Interdisciplinary Domains

Friday, July 4, 2014: 11:00 - 12:40, West Hall 3 A Bayesian approach to test the matching law with observational data

Johannes Zschache

Institute of Sociology, University of Leipzig, [email protected]

Abstract. The matching law is a widely recognised empirical regularity of individual behaviour. It describes a direct relationship between the empirical distribution over a ﬁnite set of choices and the empirical distribution over the reinforcements of these choices. There has been a considerable amount of experimental and observational research in regard to the matching law. Most of these studies apply linear models, which is appropriate if lots of experimental data with different reinforcement schedules is available for each individual. In an observational setting, on the other hand, there is usually only one reinforcement schedule and there are often only a few data points. When using signiﬁcance tests, the matching law naturally takes the place of the null hypothesis. This means that, with classical statistics, it can only be ’rejected’ or ’not rejected’ but not ’accepted’ by the data. This paper applies different Bayesian methods to test the matching law hypothesis. These methods include individual hypothesis testing by adopting the Bayes factor and hierarchical models that assume a common Dirichlet distribution or a common Dirichlet process. This paper uses data on penalty kicks that took place during professional (European) football games. The data is characterised by a very low number of cases for each player and, therefore, well suited to demonstrate these methods.

References

HERRNSTEIN, R. J. (1997): The Matching Law. Papers in Psychology and Economics. Har- vard University Press. MCAULIFFE, J. D., BLEI, D. M., and JORDAN, M. I. (2006): Nonparametric empirical Bayes for the Dirichlet process mixture model. Statistics and Computing, 16(1), 5–14.

Keywords

BAYESIAN STATISTICS, MATCHING LAW, DIRICHLET PRO- CESS

177 Handling missing data in non-Gaussian hierarchical Bayesian Dynamic Models

Casper J. Albers1

Department of Psychometrics and Statistics, Heymans Institute for Psychological Research, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands. [email protected]

Abstract. Collecting intensive longitudinal data is vastly gaining popularity in the social and behavioural sciences. Dynamic Bayesian models offer a versatile way of modelling such data. This class of models can be extended to work with hierarchical structures in the data and the parameters as well as with data for which the usual normality relations don’t hold. In this presentation, I shall introduce a model for multilevel or hierarchical time series data that are Poisson or Binomial distributed and where missing values occur. To this end, I combine a Poisson and Binomial dynamic linear models with a new approach to handle missing data. The missing data approach resembles multiple imputation, using all available data and properly estimating the uncertainty due to the missingness and the amount of data available. The model is estimated using a stepwise MCMC approach. For a two-level model, first the model at Level 2 (‘between-subjects’) is estimated. Then, the model at Level 1 (‘within-subjects) is estimated, conditional on the Level 2 model. Predictors at Level 1 and 2 can be included, though at Level 2 restricted to nominal ones. I will illustrate the method with two data sets. The first is from clinical psychiatry, on the differential effects of three treatments to panic disorder, during the first year of treatment. The severity was expressed via the number of panic attacks experienced every week, thus requiring a Poisson model. Model selection has been performed on basis of information criteria. Sub- sequently, PostHoc analyses were carried out. The second data set is from social psychology, on the change of opinions towards natural gas extraction in the Dutch province of Groningen. Opinions are classified daily in one of three categories, thus requiring a multinomial model. Three classes of missing data are specified, and for each a solution is built into the model.

References

KRONE, T., ALBERS, C.J. and TIMMERMAN, M. (2014): Missing Data in a Multilevel Poisson Dynamic Linear Model, submitted for publication.

Keywords Time Series Analysis, Longitudinal Data, Missing Data, Hierarchical Models

178 Bayesian analysis for mixtures of discrete distributions with a non-parametric component

Baba B. Alhaji, Hongsheng Dai, Yoshiko Hayashi and Berthold Lausen

Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK; [email protected], [email protected]

Abstract. Analysis of Bayesian finite mixture of distributions has received an increasing attention over the last three decades, due to their usefulness as a flexible method of Bayesian parametric modelling for classification and density fitting. In certain application areas where large discrete data sets are involved, interest lies in distinguishing ‘signal’ and ‘noise’ components, for example ChIP-seq data (Bao et al. 2013). Except for the noise, it is often difficult to justify a single distribution for the signal, because the distribution of the signal has a long tail. Therefore the signal distribution is usually further modelled via a mixture of component distributions (Kuan et al.,2011). Modelling the signal as a mixture distribution is computationally challenging as a result of justifying the number of components and the label switching problem (due to the exchangeability of the likelihood)(Stephen, 2000). To solve this problem, this paper uses a non-parametric distribution to model the ‘signal’. This new methodology for Bayesian mixture of distributions of discrete random variables is more efficient than other existing methods. The label switching problem does not occur in our study when sample size is large.

References

BAO, Y., VINCIOTTI, V., WIT, E. and HOEN, P.A.C. (2013): Accounting for immunoprecip- itation efﬁciencies in the statistical analysis of ChIP-Seq data. BMC Bioinformatics, 14, 169. KUAN, P.F., CHUNG, D., PAN, G., THOMSON, J.A., STEWART, R. and KELE, S. (2011): A statistical framework for the analysis of ChIP-Seq data. Journal of the American Sta- tistical Association, 106, 891–903 STEPHEN, M. (2000): Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 795–809.

Keywords

BAYESIAN STATISTICS, GIBBS SAMPLER, MCMC, MIXTURE OF DISTRIBUTIONS, CHIP-SEQ

179 Exploring Unknown Terrain

Irmela Herzog

The Rhineland Commission for Archaeological Monuments and Sites The Rhineland Regional Council [email protected]

Abstract. Recently, reconstructing ancient paths has become a standard tool in archaeological spatial analysis for non-uniform terrain. In most applications, the route reconstruction is based on Dijkstra’s algorithm that generates the globally optimal path between two locations. Fabrega´ Alvarez,´ P. and Parcero Oubina˜ (2007) proposed an algorithm that identiﬁes optimal radial paths for a given location. These globally optimal paths require full knowledge of the surroundings, but a newcomer will proceed differently. A new agent-based model is presented that identiﬁes probable paths for exploring the unknown terrain by a newcomer who has only limited knowledge of the landscape ahead. This contribution will compare the outcomes of newcomer models with those of full knowledge models for a study area with several known ancient routes in the Bergisches Land, Germany.

References

DIJKSTRA (1959): A note on two problems in connexion with graphs. Numerische Mathe- matik, 1, 269–271. FABREGA´ ALVAREZ,´ P., PARCERO OUBINA,˜ C., (2007): Proposals for an archaeological analysis of pathways and movement. Archeologia e Calcolatori 18, 2007, 121–140. KOHLER, T.A. and GUMERMAN, G.J. (2000): Dynamics in Primate Societies. Agent-based Modeling of Social and Spatial Processes. Oxford University Press, New York.

Keywords

DIJKSTRA’S ALGORITHM, AGENT BASED MODEL

180 32 CON-5C: Data Analysis in Social Sciences IV

Friday, July 4, 2014: 11:00 - 12:40, West Hall 4 Social Value Orientation and Expectations of Cooperation: A Meta-Analysis

Jan Luca Pletzer1, Daniel Balliet2, and Sven C. Voelpel3

1 Jacobs University Bremen, School of Humanities and Social Sciences, Campus Ring 1, 28759 Bremen, Germany [email protected] 2 VU University Amsterdam, Faculty of Psychology and Education, Social and Organizational Psychology, Van der Boechorststraat 1, 1081 BT Amsterdam, The Netherlands [email protected] 3 Jacobs University Bremen, School of Humanities and Social Sciences, Campus Ring 1, 28759 Bremen, Germany [email protected]

Abstract. Social value orientation describes the weights individuals’ attach to their own and others’ outcomes and, based on different measures, individuals are usually classiﬁed as cooperators, individualists, or competitors. In social dilemmas, expectations of others’ cooperation are assumed to be a major determinant of own cooperation (e.g., Van Lange, 1992), but evidence is inconclusive as to whether these expectations differ between cooperators, individualists, and competitors. Speciﬁcally, three different theoretical accounts have been developed to predict the amount of cooperation individuals with different social value orientations ex- pect from others: the triangle hypothesis, the structural assumed similarity bias, and the cone model. Preliminary evidence suggests that the cone model is the most accurate theoretical account (Aksoy and Weesie, 2012), but a meta-analytical assessment of expectations of cooperation among individuals with differing social value orientations will yield a better test of these accounts and provide more conclusive evidence. This presentation will focus on data analyses procedures with the Comprehensive Meta-Analysis software.

References

AKSOY, O. and WEESIE, J. (2012): Beliefs about the Social Orientations of Others: A Para- metric Test of the Triangle, False Consensus, and Cone Hypothesis. Journal of Experi- mental Social Psychology, 48, 45–54. VAN LANGE, P.A.M. (1992): Conﬁdence in Expectations: A Test of the Triangle Hypothesis. European Journal of Personality, 6, 371–379.

Keywords

SOCIAL VALUEORIENTATION, COOPERATION, EXPECTATION, META-ANALYSIS

182 Using Meta-Analytic Structural Equation Modeling (MASEM) to Test New Models in Organizational Research: The Example of Transformational Leadership Effects on Identiﬁcation at Work

Christiane A. L. Horstmeier1a, Diana Boer2, Astrid C. Homan3, and Sven C. Voelpel1b

1 School of Humanities and Social Sciences, Jacobs University Bremen, Campus Ring 1, D-28759 Bremen, Germany [email protected],[email protected] 2 Department of Social Psychology, Goethe University Frankfurt, Grueneburgplatz 1, 60323 Frankfurt am Main, Germany [email protected] 3 Work and Organizational Psychology, University of Amsterdam, Weesperplein 4, 1018 XA Amsterdam, The Netherlands [email protected]

Abstract. Meta-Analysis (MA) enables researchers to quantitatively summarize previously reported data on well-studied relationship between variables, while structural equation modeling (SEM) is used to adequately capture the complex interplay between several variables in a comprehensive model. Combining both approaches, meta-analytic structural equation modeling (MASEM) has been suggested to test new research models in organizational behavior on the basis of meta-analytic correlations among the variables. We apply this technique to shed light on the relationships between transformational leadership (TFL) and identifications with various foci (i.e., the leader, the team, and the organization). In a dataset of 53 studies (N = 15,491), we find that TFL is stronger related to leader identification than to organizational or team identification. Furthermore, we propose a model in which leader identification mediates TFLs effects on identification with the team and the organization. Our findings exemplify how MASEM can be applied in organizational research as it contributes to a more comprehensive understanding of the multiple effects of TFL on identifications and their complex interplay.

Keywords

META-ANALYTIC STRUCTURAL EQUATION MODELING, TRANS- FORMATIONAL LEADERSHIP, IDENTIFICATION

183 Biasing Effects of Non-representative Samples of Quasi-orders in the Assessment of Recovery Quality of IITA-type Item Hierarchy Mining

Ali Unl¨ u¨1 and Martin Schrepp2

1 Technische Universitat¨ Munchen,¨ Munich, Germany [email protected] 2 SAP AG, Walldorf, Germany [email protected]

Abstract. Item hierarchies modeled as quasi-orders postulate mastery implications among educational test items and play an important role in the psychometric theory of learning spaces. Assessment instruments based on implications between items may be used to design adaptive knowledge assessment and training procedures. We study inductive item tree analysis (IITA) for mining item hierarchies or quasi-orders in educational assessment data. Given dichotomously scored response data as their inputs, IITA analyses produce a set of plausible implications between the items. We answer this question: “How to design simulation studies to reliably evaluate and compare the quality of IITA- type item hierarchy mining techniques for reconstructing the true implications from observed data?” It is shown that this question essentially reduces to the problem of realizing samples of quasi-orders that are representative for the population of all possible quasi-orders. Biases or wrong conclusions induced by non-representative samples of quasi-orders at the basis of the simulation are exempliﬁed with three IITA algorithms, based on the ﬁndings we report for two simulation studies. One study uses absolute normal sampling for quasi-order generation, which is biased, the other applies simple random sampling, which yields representative random quasi-orders.

References

FALMAGNE, J-CL. and DOIGNON, J.-P. (2011): Learning Spaces. Springer, Berlin. SARGIN, A. and UNL¨ U,¨ A. (2009): Inductive Item Tree Analysis: Corrections, Improve- ments, and Comparisons. Mathematical Social Sciences, 58, 376–392. SCHREPP, M. (1999): On the Empirical Construction of Implications between Bi-valued Test Items. Mathematical Social Sciences, 38, 361–375.

Keywords MINING ITEM HIERARCHIES, INDUCTIVE ITEM TREE ANAL- YSIS, EDUCATIONAL ASSESSMENT, REPRESENTATIVE QUASI- ORDERS

184 Let’s Get Dynamic: Interaction Analysis

Fabiola H. Gerpott1 and Nale Lehmann-Willenbrock2

1 Jacobs University Bremen, VU Amsterdam [email protected] 2 VU Amsterdam [email protected]

Abstract. Several scholars have recently called for more dynamic approaches for understanding teamwork processes (e.g., Cronin, Weingart and Todorova, 2011). The input-process- output model (Hackman and Morris, 1975) suggests that internal and external factors influence team performance via social interaction processes. We introduce interaction analysis as a methodological tool for gaining insights into the micro-processes that characterize team interactions. Specifically, we illustrate how the act4teams coding scheme can highlight the behavioral processes that differentiate successful teams from less successful teams (Kauffeld and Lehmann-Willenbrock, 2012). To use act4teams, team meetings are videotaped and verbal behaviors are coded unit-by-unit into mutually exclusive categories (problem-focused, proce- dural, socioemotional, and action-oriented behaviors). The coded data can be analyzed via sequential analysis to understand how specific behaviors trigger specific other reactions within the interaction process. In addition, the behavioral data can be pooled into overall frequencies that are then linked to individual and team attitudes and performance outcomes. Besides presenting the methodological advantages of interaction analysis, we discuss limitations and avenues for future research.

References

CRONIN, M. A., WEINGART, L. R., TODOROVA, G. (2011). Dynamics in Groups: Are we there Yet? The Academy of Management Annals, 5, 571-612. HACKMAN, J. R. and MORRIS, C. G. (1975). Group Tasks, Group Interaction Process, and Group Performance Effectiveness: A Review and Proposed Integration. Advances in Ex- perimental Social Psychology, 8, 45-99. KAUFFELD, S. and LEHMANN-WILLENBROCK, N. (2012). Meetings Matter: Effects of Team Meetings on Team and Organizational Success. Small Group Research, 43, 130- 158.

Keywords

INTERACTION ANALYSIS, TEAM EFFECTIVENESS, PROCESSES

185 33 CON-5D: Data Analysis in Interdisciplinary Domains (Musicology II)

Friday, July 4, 2014: 11:00 - 12:40, West Hall 5 Duplicate detection in facsimile scans of early printed music

Christophe Rhodes, Tim Crawford, and Mark d’Inverno

Department of Computing, Goldsmiths, University of London {c.rhodes,t.crawford,dinverno}@gold.ac.uk

Abstract. There is a growing number of collections of readily-available scanned musical documents, whether generated and managed by libraries, research projects or volunteer efforts. They are typically digital images; for computational musicology we also need the musical data in machine-readable form. Optical Music Recognition (OMR) can be used on printed music, but is prone to error, depending on document condition and the quality of intermediate stages in the digitization process such as archival photographs. In performing OMR on the British Library’s Early Music Online collection (Pugin and Crawford, 2013) of 16th century volumes we must deal with the problem of images which are rescans of the same pages. These images are not precise digital duplicates of each other, and so must be detected through some approximate means. As well as duplicate scans, there are other forms of similarity present in the collection, such as musical relatedness and movable type reuse. We present our work on developing and combining image-based near-duplicate detection, based on SIFT features (Lowe, 1999), with OMR-based musical content near-duplicate detection. We evaluate an order-statistic based method for ﬁnding duplicate scans of pages, and additionally identify a number of distinct kinds of approximate similarity from our distance measures: substantial reuse of graphical material; musical quotation; and title page detection.

References

PUGIN, L. and CRAWFORD, T. (2013): Evaluating OMR on the Early Music Online Col- lection. In: Proc. International Society for Music Information Retrieval Conference. Cu- ritiba, Brazil, 439–444 LOWE, D.G. (1999): Object recognition from local scale-invariant features. In: Proc. Int. Conf. on Computer Vision. Corfu, Greece, 1150–1157

Keywords

MUSIC, OPTICAL MUSIC RECOGNITION, CLUSTERING, SIMI- LARITY MEASURES

187 Combining audio features and playlist statistics for improved music category recognition

Igor Vatolkin, Geoffray Bonnin, and Dietmar Jannach

TU Dortmund, Faculty of Computer Science {igor.vatolkin;geoffray.bonnin;dietmar.jannach}@tu-dortmund.de

Abstract. In recent years, a number of approaches have been developed for the automatic recognition of music genres, but also more specific categories (styles, moods, personal preferences, etc.), enabling better computerized support for the organization and management of digital music collections. Among different sources for the training of classification models, features extracted from the audio signal play an important role in the literature. Although such features can be extracted from any digitized music piece independently of the availability of other information sources, their extraction can require considerable computational costs and the audio alone does not always contain enough information for the identification of distinctive properties of the related category. Another interesting source for feature extraction are playlists created and shared by music listeners. The tracks of a playlist are often not too different in terms of their genre which makes it possible to exploit their co-occurrences to predict genres or other categories. In this paper we propose different approaches to combine audio features with track co-occurrences in playlists with the aim to enhance the quality of genre and style classification.

References

KNEES, P., POHLE, T., SCHEDL, M., and WIDMER, G. (2006): Combining Audio-based Similarity with Web-based Data to Accelerate Automatic Music Playlist Generation. In: Proc. of the 8th ACM SIGMM International Workshop on Multimedia Information Re- trieval, 147–154. STURM, B. (2012): A Survey of Evaluation in Music Genre Recognition. In: Proc. of the 10th International Workshop on Adaptive Multimedia Retrieval (AMR). VATOLKIN, I. (2013): Improving Supervised Music Classiﬁcation by Means of Multi- Objective Evolutionary Feature Selection. PhD thesis, Department of Computer Science, TU Dortmund, 2013.

Keywords

MUSIC CLASSIFICATION, AUDIO FEATURES, PLAYLIST STATIS- TICS

188 Digital Music Lab - A Framework for Analysing Big Music Data

Tillman Weyde1, Stephen Cottrell1, Emmanouil Benetos1, Daniel Wolff1, Dan Tidhar1, Jason Dykes1, Mark Plumbley2, Simon Dixon2, Mathieu Barthet2, Nicolas Gold3, Samer Abdallah3, and Mahendra Mahey4

1 City University London {t.e.weyde,stephen.cottrell.1, emmanouil.benetos.1,daniel.wolff.1,dan.tidhar.1,j.dykes}@city.ac.uk 2 Queen Mary University of London {mark.plumbley,s.e.dixon,m.barthet}@qmul.ac.uk 3 University College London {n.gold, s.abdallah}@ucl.ac.uk 4 The British Library [email protected]

Abstract. The Digital Music Lab is a new research project, funded by the AHRC UK, in which research methods and a software framework for analysing Big Music Data are developed. We develop the musicological methodology for large and heterogeneous data with state-of-the-art algorithms for automated music analysis and tools for collection-level analysis and visualisations. The software services, tools and derived data will be made available for reuse and collaboration. Music research has developed as “data oriented empirical research” [1]. However, this research has so far been limited to relatively small datasets. On the other hand, researchers in Music Information Retrieval (MIR) are exploring large datasets (e.g. the Million Song Dataset [2] The Digital Music Lab aims to bridge the gap to MIR by supporting the application of statistical analysis and machine learning in musicology. We are preparing two prototype installations of the Digital Music Lab infrastructure, at the British Library and I Like Music, which will provide facilities to analyse a collection of over 1m pieces of music. We will provide tools and visualisations that combine of state-of-the-art music analysis with intelligent collection-level analysis, enabling data driven music research in a way that was not possible so far.

References

[1] PARNCUTT, R. (2007): Systematic musicology and the history and future of Western musical scholarship. J. of Interdisciplinary Music Studies, 1, 1-32. [2] BERTIN-MAHIEUX, T.; ELLIS, D.; WHITMAN, B and LAMERE, P. (2011): The Mil- lion Song Dataset. In: Proceedings of ISMIR 2011, 591–596.

189 190 Weyde, Plumbley, Gold, Mahey et al. Keywords

Music Analysis, Computational Musicology, Software Infrastructure Machine Learning for the Analysis of a Large Collection of Musical Scales

Srikanth Cherla, Dan Tidhar, Artur d’Avila Garcez, and Tillman Weyde

Music Informatics Research Group, City University London {srikanth.cherla.1, dan.tidhar.1, a.garcez, t.e.weyde}@city.ac.uk

Abstract. A scale, in music, is a collection of musical notes which can be played together either in sequence or in harmony. Scales are fundamental to the melodic and harmonic structure of music, and listeners often associate a scale with a particular mood, a region of the world, or a musical era. In this paper, we present a machine learning approach towards the analysis of a large collection of scales (around 3,500) in the Scala musical scale dataset [1]. We start with a binary representation of a scale obtained by taking into account the distance of its component notes from the tonic. We then train three unsupervised learning models — principal components analysis [2], a restricted Boltzmann machine, and a deep belief network [3] — to obtain low-dimensional, dense, real-valued vector representations for the scales. On visualizing these scale representations in 2D using the t-SNE algorithm [4], we observed that the learned representations effectively capture scale-similarity information — labels corresponding to similar scales are clustered close together and vice versa in the map. We visually compare these resultant scale maps generated using the three models. Furthermore, we carry out a musicological analysis of the maps and attempt to provide further insights into the meaning of the learned representations.

References

[1] http://www.huygens-fokker.org/scala/ [2] DUNTEMAN, G.H. (1989): Principal components analysis (No. 69). Sage. [3] HINTON, G.E. and SALAKHUTDINOV, R.R. (2006): Reducing the dimensionality of data with neural networks. Science, 313 (5786), 504–507. [4] VAN DER MAATEN, L. and HINTON, G.E. (2008): Visualizing Data using t-SNE. Jour- nal of Machine Learning Research, 9 (11), 2579–2605.

Keywords

Musical Scales, Restricted Boltzmann Machines, Deep Belief Net- works, Principal Components Analysis, t-SNE

191 34 CON-5E: Clustering II

Friday, July 4, 2014: 11:00 - 12:40, West Hall 6 Supervised pretreatments are useful for supervised clustering

Vincent Lemaire1, Oumaima Alaoui Ismaili1,2, and Antoine Cornuejols´ 2

1 Orange Labs, 2 av. Pierre Marzin, 22300 Lannion 2 AgroParisTech 16, rue Claude Bernard 75005 Paris

Abstract. This paper centers on supervised clustering. Unlike traditional clustering, supervised clustering is applied to classified examples. Supervised clustering is “a clustering technique where instances in the resulting groups (or clusters) are the most similar and belong simultaneously to the same target class” [1]. This kind of method is used for example to create simultaneously groups of instances (X) and predictions for the target variable (C). Each group could then be described using the center (profile) of the group. The reader can find a detailed description of this type of application in [3]. The paper aims to study the influence of the pretreatments, realized before the clustering, on the results (accuracy) obtained with an unsupervised k-means [3] when the information given by a target variable is available. To illustrate the impact on the performances, an empirical evaluation is presented on a large panel of real data sets taken from the UCI repository (30 UCI bases) constituted of both categorical and numerical data. Then the results using the best pretreatments will be compared to a supervised k-means [2]. We will show that supervised pretreatments which estimate the univariate conditional density (P(X|C)) are very useful when the purpose is to create a supervised clustering.

References

[1]Christoph F. Eick and Nidal Zeidat and Zhenghong Zhao Supervised Clustering - Algo- rithms and Benefits. In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI04) , Boca, 774-776 ,2004. [2]Al-Harbi, S. H. and Rayward-Smith, V. J.. Adapting K-means for Supervised Clustering. Applied Intelligence, 24(3):219–226, 2006. [3]Lemaire, Vincent and Clerot,´ Fabrice and Creff, Nicolas K-means clustering on a classifier- induced representation space : application to customer contact personalization. Annals of Information Systems, Special Issue on Real-World Data Mining Application,2014.

Keywords

Supervised Clustering, K-means, Preprocessing

193 Bottom-up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

Hans-Joachim Mucha1 and Hans-Georg Bartel2

1 Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin, Mohrenstraße 39, Germany, [email protected] 2 Department of Chemistry at Humboldt University, Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany, [email protected]

Abstract. Variable selection is a well-known problem in many areas of multivariate statistics such as classification, clustering and regression. The hope is that the structure of interest may be contained in only a small subset of variables. In contradiction to supervised classification such as discriminant analysis, variable selection in cluster analysis is a much more difficult problem because usually nothing is known about the true class structure. In addition, in clustering, variable selection is highly related to the main problem of the determination of the number of clusters K to be inherent in the data. There are many papers on variable selection in clustering, mainly based on special cluster separation measures such as the Davies and Bouldin (1979) criterion: ratio of within-cluster dispersions and between-cluster separation. Here we present a general bottom-up approach to variable selection using non-parametric bootstrapping based on criteria of stability such as the adjusted Rand’s index (Hubert and Arabie, 1985). General means, it makes only use of measures of stability of partitions, and so it can be applied to almost any cluster analysis method. Here we choose bootstrapping as the favorite resampling technique because of its good performance in finding the number of clusters K (Mucha and Bartel, 2014).

References

DAVIES, D. L. and BOULDIN, D. W. (1979): A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1 (2), 224–227. MUCHA, H.-J. and BARTEL H.-G. (2014): Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods. In: Spiliopoulou, M., Schmidt-Thieme L. and Janning, R. (Eds.): Data Analysis, Machine Learning and Knowledge Discovery. Springer, Berlin, 97–104. HUBERT, L. J. and ARABIE, P. (1985): Comparing Partitions. Journal of Classiﬁcation 2, 193–218.

194 Bottom-up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal 195 Keywords

CLUSTERING, VARIABLESELECTION, BOOTSTRAPPING, RAND’S INDEX Bayesian clustering of functional data in the presence of covariates

Damien Juery1, Christophe Abraham1, and Ben´ edicte´ Fontez1

UMR MISTEA, Montpellier SupAgro, 2 place Pierre Viala, 34060 Montpellier cedex 2 [email protected]

Abstract. This paper focuses on Bayesian unsupervised learning of curves. The subject matter consists in generalizing a typical clustering model, based on the Dirichlet process, to the functional case. An additional stage is also introduced in order to take covariates into account. Contrary to other methods which use the finite dimension by either projecting the curves onto a basis or by regarding them as multivariate data, the specificity of this work lies in the calculations, which are made all along in the infinite dimension. Indeed, the Reproducing Kernel Hilbert Space (RKHS) theory allows to derive, in the infinite dimension, probability density functions of curves with respect to a reference measure. In this paper, the reference measure is shown to be a Gaussian measure. Likewise, it is able to derive formulae for posterior distributions, given entire curves and not only discretized data. This paper leads at last to the generalization of the so-called Gibbs with Auxiliary Parameters algorithm (NEAL, 2000) within the functional framework. Some results are presented, obtained from various applications, and performances are compared to those other methods.

References

NEAL, R. M. (2000): Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9, 249–265.

Keywords

BAYESIAN STATISTICS, FUNCTIONAL DATA, NONPARAMET- RIC CLUSTERING, DIRICHLET PROCESS

196 K-mode Clustering with Dimensional Reduction for Categorical Data

Kensuke Tanioka1 and Hiroshi Yadohisa2

1 Graduate school of Culture and Information Science, Doshisha, University, 1-3 Tatara Miyakodani, Kyotanabe-shi, Kyoto-fu 610-0394. [email protected] 2 Department of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe-shi, Kyoto-fu 610-0394. [email protected]

Abstract. In various social science fileds, observed multivariate data is often categorical. For one of the methods to detect structres of the data, there exists Multi Correspondence Analysis (MCA). However, it is difficult to interpret the results when the data is large. To overcome this problem, Hwang et al. (2006) proposed MCA in which objects are classified simultaneously. This method provides us the way to interpret each cluster easily. However there still exists three problems. First, when multivariate categorical data includes noise variables which does not contain clustering structures, the result could be affected by noise because all variables are treated in the same way. Second, the method assumes some variables contain clustering structure. However, the assumption is not natural. It is more natural that each cluster is embedded in different subspace than the assumption of the method of Hwang et al. Finally, when there are many variables in the given data, the result cannot be interpreted visually. Therefore, in this paper, we proposed a new method for categorical data based on subspace method and k-mode clustering (Huang, 1998) to overcome these problems.

References

HUANG, Z. (1998): Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values . Data Mining and Knowledge Discovery, 2, 283–304. HWANG, H., DILLON, W.R. and TAKANE, Y. (2006): An Extension of Multiple Correspon- dence Analysis for Identying Heterobeneous Subgroups of Respondents . Psychometrika, 71, 161–171.

Keywords

SUBSPACE CLUSTERING, VARIABLE SELECTION

197

Part V

LIS’2014 Workshop

35 Workshop on Classiﬁcation and Subject Indexing in Library and Information Science (LIS’2014)

Organized by Frank Scholze

Day 1:

Wednesday, July 2, 2014: 14:00 - 17:30, IRC Seminar Room Collaborative Literature Work in the Scientiﬁc & Educational Publication Process: The Cogeneration of Citation Networks

Leon Burkard1 and Andreas Geyer-Schulz1

Karlsruhe Institute of Technology, Institute of Information Systems and Marketing, Chair of Information Services and Electronic Markets, Kaiserstraße 12, D-76131 Karlsruhe {leon.burkard, andreas.geyer-schulz}@kit.edu

Abstract. In educational & scientiﬁc publishing processes scientists and prospective scientists (students) in their different roles (author, editor, reviewer, production editor, lector, reference librarians) invest a large amount of work into the proper handling of scientiﬁc literature in the widest sense. In this contribution we introduce the LitObject-Middleware and its combination with the popular open-source tool Zotero. The LitObject-Middleware supports scientists to exchange literature objects consisting of bibliographic references and documents (e.g. PDF- documents). In our contribution we emphasize several process improvements with a special focus on the cogeneration of citation networks.

References

GLNZEL, W., SCHUBERT, A., (2004): Analyzing Scientiﬁc Networks Through Co- Authorship, In: Moed, H.F., Glnzel, W., Schmoch, U. (Eds.), Handbook of Quantitative Science and Technology Research. Kluwer HENSLEY, M. K. (2011): Citation Management Software: Features and Futures. Reference & User Services Quarterly, 50(3), 204–208. LEUNG, S.-W., PENG, Y. (2000): Building Web-enabled bibliography databases for collaborative research by using open source software tools. Program, 34(3), 291–296. LOCKEMANN, P., CHRISTOFFEL, M., PULKOWSKI, S., SCHMITT, B. (2000): UniCats: ein System zum Beherrschen der Dienstevielfalt im Bereich der wissenschaftlichen Lit- eraturrecherche. It - Information Technology, 42(6), 34–40.

Keywords

Literature Management, Collaborative Work, Interoperability, Data collection, Citation Networks

202 Context Analysis and Context Indexing: Formal Pragmatics in Knowledge Organization

Michael Kleineberg

Berlin School of Library and Information Science, Humboldt-Universitat¨ zu Berlin, Dorotheenstr. 26, 10117 Berlin, Germany [email protected]

Abstract. In the field of library and information science (LIS), bibliographic metadata or information resource descriptions are limited either to descriptive indexing oriented on syn- tactics or to subject indexing oriented on semantics. In face of the challenge of an increasing cross-disciplinary and intercultural knowledge exchange, this paper proposes a complementary kind of indexing based on pragmatics in order to describe the context of meaning production in terms of viewpoints adopted and methods applied by authors or creators of documents. According to Jurgen¨ Habermas’ formal pragmatics of communication, there are two modes of meaning explication. The first one corresponds to traditional subject analysis and is concerned with the semantic content of a symbolic formation or the author’s explicit know-that. In contrast, the second mode seeks to reconstruct the underlying generative structure or the author’s implicit know-how. This means the historically and socially situated epistemic framework that determines the rules according to which a symbolic formation is produced. Thus, the task of context analysis is to make this implicit rule consciousness explicit by using the methodology of rational reconstruction as it has been developed in reconstructive sciences such as cognitive psychology, cultural anthropology, sociology, or linguistics. Complementary to empirical pragmatics which is related to the specific use of language within practice and discourse communities, the decisive advantage of formal pragmatics is to take the universal conditions of communication and mutual understanding into account by means of a comprehensive classification of speech acts, validity spheres, and levels of communicative competence. Therefore, universal or formal pragmatics provides a theoretical foundation for context analysis and context indexing in knowledge organization. As an analytical tool it enables the indexer not only to identify particular epistemic contexts but most notably to interrelate different viewpoints and different methods across knowledge domains, disciplines, or cultures based on a systematic organization of generative structures. An example of the proposed context indexing will be demonstrated on the basis of recent LIS literature which is concerned with the same subject matter approached from multiple perspectives, namely with the nature of human knowledge.

References

BEGHTOL, C. (1998): Knowledge Domains: Multidisciplinarity and Bibliographic Classiﬁ- cation Systems. Knowledge Organization, 25 (1/2), 1–12.

203 204 Michael Kleineberg

BIAGETTI, M.T. (2006): Indexing and Scientiﬁc Research Needs. In: G. Budin, C. Swertz and K. Mitgutsch. (Eds.): Knowledge Organization for a Global Learning Society: Pro- ceedings of the Ninth International ISKO Conference, 4-7 July 2006, Vienna Austria, Vol. 10. Ergon, Wurzburg,¨ 241–246. BIES, W. (1992): Linguistische Pragmatik: Eine vernachlassigte¨ Referenzdisziplin der Inhalt- serschließung. In: W. Godert.¨ (Ed.): Kognitive Ansatze¨ zum Ordnen und Darstellen von Wissen. Indeks, Frankfurt am Main, 207–216. FUGMANN, R. (1993): Subject Analysis and Indexing: Theoretical Foundation and Practical Advice. Ergon, Wrzburg. GNOLI, C. (2011): Animals Belonging to the Emperor: Enabling Viewpoint Warrant in Clas- siﬁcation. In: P. Landry et al. (Eds.): Subject Access: Preparing for the Future. De- Gruyter, Berlin, 91–100. HABERMAS, J. (1998): On the Pragmatics of Communication. MIT Press, Cambridge. HJØRLAND, B. (1997): Information Seeking and Subject Representation: An Activity- Theoretical Approach to Information Science. Greenwood, Westport. KAIPAINEN, M. and HAUTAMAKI,¨ A. (2011): Epistemic Pluralism and Multi-Perspective Knowledge Organization: Explorative Conceptualization of Topical Content Domains. Knowledge Organization, 38 (6), 503–514. KLEINEBERG, M. (2013): The Blind Men and the Elephant: Towards an Organization of Epistemic Contexts. Knowledge Organization, 40 (5), 340–362. MAI, J.-E. (2001): Semiotics and Indexing: An Analysis of the Subject Indexing Process. Journal of Documentation, 57 (5), 591–622. SZOSTAK, R. (2004): Classifying Science: Phenomena, Data, Theory, Method, Practice. Springer, Dordrecht. THELLEFSEN, T.L. and THELLEFSEN, M.M. (2004): Pragmatic Semiotics and Knowledge Organization. Knowledge Organization, 31 (3), 177–187. WEINBERG, B.H. (1988): Why Indexing Fails the Researcher. The Indexer, 16, 3–6.

Keywords

KNOWLEDGE ORGANIZATION, CONTEXT INDEXING, UNI- VERSAL PRAGMATICS, RATIONAL RECONSTRUCTION PubSim: A graph based classiﬁcation recommendation system for mathematical publications

Susanne Gottwald and Thorsten Koch

Zuse Institute Berlin [gottwald|koch]@zib.de

Abstract. We present a system to automatically classify mathematical publications based on metadata graph analysis.

Metadata of one publication has a relation to other publication’s metadata, e.g. the author might have several publications, the citations link related papers to each other, as does the journal gives indications on common subjects. We apply this concept to the whole metadata and using existing data of one million publications build a graph with more than 15 million nodes describing the relations between the articles.

Using this graph we build PubSim a system that calculates the most similar publications based on the graph. Using a proximity measure PubSim automatically recommends probable classes for unclassiﬁed publications. The system is classication-independent and can be applied with interdisciplinary classiﬁcations.

References

ZAN HUANG and WINGYAN CHUNG (2002): A Graph-based Recommender System for Digital Library. In: Proceedings of the 2Nd ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, 65–73 PRICE, D. J. deSOLLA (1965): Networks of Scientiﬁc Papers: The pattern of bibliographic references indicates the nature of the scientiﬁc research front. Science, 149(3683), 510– 515. BANI-AHMAD, S., and AL-HAMDANI, A. (2005). Evaluating Publication Similarity Mea- sures. IEEE Data Engineering Bulletin, 28(4), 21-28.

Keywords

CLASSIFICATION, RECOMMENDER SYSTEMS, BIG DATA

205 Subject indexing for author disambiguation - opportunities and challenges.

Cornelia Hedeler1, Andreas Oskar Kempf2, and Jan Steinberg2

1 School of Computer Science, University of Manchester, UK [email protected] 2 GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany [email protected], [email protected]

Abstract. Author disambiguation is becoming more important due to the increased availability of publications in digital libraries. Various approaches for author disambiguation are available [1], utilising a variety of information, e.g., author name, affiliation, title, journal and conference name, co-author-, citation-, and topic information [1]. Topics can be obtained, e.g., using subject information captured in various controlled vocabularies, classifications and mappings between them used to index publications [2]. Research interests of authors, evi- dent in topics, might change over time though[1], and thus limit their usefulness for author disambiguation. Here we present a longitudinal analysis of topics with respect to their suitability for author disambiguation. We analyse the distribution of subject headings taken from the Thesaurus and the Classification for the Social Sciences (TheSoz)3 for research projects and literature (available in the portal sowiport4 maintained by GESIS) and the changes in distribution over time. To assess the suitability of subject information for author disambiguation more closely, we then analyse the changes in the annotation over time for a representative selection of authors, author groups at different stages in their career, and different documents types, also taking into account the hierarchical organisation of the applied controlled vocabularies.

References

1.FERREIRA, A.A. et al. (2012): A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2), 15–26. 2.TORVIK, V.I. et al. (2005): A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2):140–158.

Keywords

SUBJECT INDEXING, AUTHOR DISAMBIGUATION, TOPICS 3 http://www.gesis.org/en/services/research/thesauri-und-klassiﬁkationen/ 4 http://sowiport.gesis.org

206 Subject indexing for author disambiguation - opportunities and challenges. 207 Day 2:

Thursday, July 3, 2014: 11:00 - 16:30, IRC Seminar Room Bibliographic Report 2013: A choice of relevant decimal classiﬁcation literature // Online report 2013: A choice of interesting web-features for classiﬁcations

Bernd Lorenz1

Fachhochschule fur¨ offentliche¨ Verwaltung und Rechtspﬂege, Munchen¨

Abstract. Bibliographic Report 2013: A choice of relevant decimal classification literature DDC: Jahns, Yvonne - Karg, Helga: Inhaltserschließung juristischer Literatur in der Deutschen Nationalbibliothek (= Recht. Bib- liothek. Dokumentation 43, 2013 S. 14-29) (DDC S. 17-21; S. 20: Ein langjahriges¨ Desiderat in der deutschen Bibliothekslandschaft ist eine Konkordanz zwischen der RVK und der DDC.) Mai, Jens-Erik: Ethics, Values and Morality in Contempory Library Classifications (= Knowledge Organiza- tion 40, 2013 S. 242–253) (DDC 21, S. 247 f. ”Racial groups”) Eberhardt, Joachim: berichtet uber¨ den Entschluß der Schwedischen Nationalbibliothek,“von der schwedischen Klassifikation SAB zur DDC zu wechseln”. (= ABI Technik 33, 2013 S. 64) RVK (Regensburg Network Classification): Bastian, Stefan: Vorfahrt fur¨ die elektronischen Medien (= b.i.t.online, Sonderheft BIX 2013 S. 37-40) (S. 39: (In der RWTH Aachen) ”werden elektronische Bucher¨ und Zeitschriften konsequent in- haltlich mit der Regensburger Verbundklassifikation (RVK) erschlossen. Die RVK kommt fur¨ alle Facher¨ der Fakultaten¨ (ohne Medizin) in den Freihandbereichen als einheitliches Erschließungsinstrument zum Einsatz.....Die Literaturbestande¨ der medizinischen Bibliothek werden jedoch nach der Systematik der National Library of Medicine (NLM) aufgestellt. Fur¨ alle Medien gilt, dass zusatzliche¨ sachliche Erschließungen als Fremdleistung uber-¨ nommen werden, unter anderem Dewey-Notationen und RSWK-Schlagworter.”)¨ Franke, Michael: Spielerisch Bucher¨ platzieren. Eine Software fur¨ visuelles Belegungsmanagement (= b.i.t.online 16, 2013 S. 118-123) (S. 119: ”Die Integration der 24 beteiligten Bibliotheken (der FU Berlin) ist nicht nur eine organisatorische oder raumliche.¨ Die 1,1 Millionen Medien, die zu Beginn des Projektes 2007 mit 32 Haussystematiken aufgestellt waren, werden momentan mittels der Regensburger Verbundklassifikation (RVK) umsystematisiert”....) Klassifikationsuberblick:¨ Desale, Sanjay K. - Kumbhar, Rajendra M.: Research on Automatic Classification of Documents in Library Environment: A Literature Review (= Knowledge Organization 40, 2013 S. 295-304)

208 Bibliographic and Online Report 2013 209

Oberhauser, Otto: Inhaltliche Erschliessung im Verbund: Die aktuelle Situation in Osterreich¨ (= Mitteilungen der VOB¨ 66, 2013 S. 231-249) dav. S. 237-247: 3. Klassifikatorische Sacherschließung (enth. BK-DDC-RVK; Msc, ZDB) S. 238 f.: Konkordanzprojekte betr. RVK: Sozialarbeit aus D zu BK (2011) Politikwiss. zu BK (2011) Germanistik zu BK (2012-2013) S. 242-244: RVK S. 242: ”In den Jahren seit 2000 hat sich die RVK zur fuhrenden¨ Aufstellungssystematik in Osterreichs¨ Bibliotheken (Universitaten,¨ Fachhochschulen) entwickelt.” Klassifikationen im Umfeld: Cooperative Patent Classification (CPC):

Geiß, Dieter: Neue Wege der Patent- und Markeninformation (IWP 64, 2013 S. 216-227) (S. 222 f. recht ausfuhrliche¨ Darstellung der CPC) LCC: Murphy, Julie - Long, Dallas - MacDonald, Jean B.: Students’ Understanding of the Library of Congress Call Number System (= Reference Librarian 54, 2013 S. 103-117) Panthema: Semantische Ordnung. PanThema heißt ein neues Buchklassiﬁkationssystem, dessen ausge- feilte Begriffshierarchie die Warengruppensystematik erganzt.¨ Im Januar 2014 startet der Pi- lotbetrieb. (= borsenblatt¨ vom 15.8.2013, Nr. 180 Jg.) Online Report 2013: A choice of interesting web-features for classiﬁcations

Michael Franke

Freie Universitat¨ Berlin, Universitatsbibliothek¨

Abstract. Kurze Prasentationen¨ von Webseiten, Katalogen oder anderen Web-Features, die klassiﬁkatorische Elemente enthalten, im Sinne eines Best Practice Berichts. Da ich mich neu in Hrn. Lorenz Veranstaltung einbringe, zunachst¨ nur in einem kleineren Rahmen. Beispiele: BAsel Register of Thesauri, Ontologies & Classiﬁcations http://www.bartoc.org/ Weitere Beispiele folgen.

210 Classiﬁcation Systems in german public libraries. An overview about the status quo and its application.

Frank Seeger ekz.bibliotheksservice GmbH, Reutlingen, Germany, [email protected]

Abstract. At the moment there are four main classification systems to be used in public libraries all over germany: ASB, KAB, SfB and SSD. Usage and development of these had been done in different periods of time. Today the organisation and permanent work of and with the SfB can be seen as the best approach to gain an always up to date classification system. So this in mind the ASB and KAB set up nearly the same workaround and is also being published in a wiki. So this speech will give an overview of the status quo, the specific characters and their further development for these shelf classification systems.

211 Storing and Analyzing Bibliographic Metadata with ElasticSearch

Clemens Dupmeier¨ 1

Institute for Applied Computer Science (IAI), Karlsruhe Institute of Techology (KIT), 76344 Eggenstein-Leopoldshafen Hermann-von-Helmholtz-Platz 1 [email protected]

Abstract. The openTA project is a German Research Foundation (Deutsche Forschungsge- meinschaft, DFG) funded research project with the goal of establishing a service oriented communication and information infrastructure and web portal as central information harvesting, aggregation and access platform for the Technology Assessment Community (NTA network) in the DACH (German, Austria, Switzerland) area. The openTA IT implementation focuses on the concept of service orientation and the idea that TA and NTA network relevant information should be collected from NTA partners and other sources using standardized formats and service APIs covering the most important information types (i.e. news, publications, events and calendar information). Additionally, information will also be harvested by crawling organizational web pages and analyzing metadata and microdata formats within these pages. Instead of storing the resulting metadata information in relational databases the openTA harvesting infrastructure converts incoming data into JSON (JavaScript Object Notation) objects which will then be analyzed, de-duplicated, grouped and ﬁnally“linked” to other information objects (i.e. representing institutional or personal NTA members) or other resources. This results in extended JSON-LD (JSON Linked Data) objects which will then be stored into an ElasticSearch server infrastructure up to and including the original source data for further use. ElasticSearch is a modern implementation of a structured search index server (like Solr) which natively stores JSON format objects (and this includes JSON-LD objects) in an underlying scalable NoSQL database infrastructure without the need of formally declaring a database schema or the restriction that incoming objects must be strictly adhere to the declared schema. And this makes it an ideal candidate for storing large quantities of information objects with very different internal structure and quality on the ﬂy while harvesting information. But as an environment for searching and indexing structured data ElasticSearch also provides sophisticated functionalities for comparing incoming objects with already stored objects (i.e. for de-duplication, harmonization and enhancement of data) and for analyzing and accessing already stored information objects in real-time using the ElasticSearch Query API. This talk will give a short introduction to basic functionalities of the ElasticSearch server infrastructure and will then focus on how the openTA project uses an ElasticSearch server and other open source data frameworks for harvesting, storing, processing and accessing bibliographic data with relevance to the NTA community. The openTA approach will be compared to other similar approaches and advantages and limitations of the current implementation will be discussed. Finally, an outlook on future work will be given.

212 Subject Indexing of Textbooks - Challenges in the Construction of a Discovery System

Esther Chen, Jessica Drechsler, Bianca Pramann and Robert Strtgen1

Georg Eckert Institute for International Textbook Research {chen|drechsler|pramann|stroetgen}@gei.de

Abstract. The library of the Georg Eckert Institute (GEI) contains one of the most comprehensive collections of international textbooks in the world, attracting many domestic and foreign researchers due to the exceptional quality of its cataloguing and indexing. Such a specialised collection presents very particular challenges in terms of subject indexing, a process which is of paramount importance for textbooks as traditional search parameters are sel- dom relevant. The fact that very few institutions comprehensively catalogue textbook content presents an additional obstacle as there is no suitable, existing classification system to repli- cate. The GEI currently uses a classification system of its own design. A strong desire for increased standardisation and better compatibility, sustainability and expandability of textbook classification drove the search for comparable standards for federal states/regions, levels of education and subjects. After reviewing diverse well-proven classification systems, we came to the conclusion that no one classification system could be implemented within the GEI that would satisfactorily fulfil all requirements in equal measure. Therefore we broke down the process of content indexing and assembled a process using appropriate components from a range of available classification systems. The decision to implement different classification systems enables textbook and curricula content to be catalogued and indexed in a bespoke, yet still standardised process. Using this standardised classification system a structured search using parameters such as country of validity, subject and education level can be conducted in the Curricula Workstation, a curricula information system, and GEI-DZS, a database containing textbook approval information. The results produced by using these procedures for subject indexing are of particular value for the display of the textbook collection in the GEI’s Dis- covery System. The ability to restrict the search according to country of validity, subject and education level remedies an urgent user desideratum by providing a simplified search process, one which could not be implemented in a traditional OPAC system.

Keywords

Research library, subject indexing, classiﬁcation system, information system, discovery system

213 The ofness and aboutness of survey questions: improved indexing of social science data

Tanja Friedrich1 and Dr. Pascal Siegers2

1 GESIS, Cologne, Germany [email protected] 2 GESIS, Cologne, Germany [email protected]

Abstract. The data intensive research paradigm (Hey et al. 2009) and our consequent endeav- ours to foster data sharing require enhanced data documentation and, in particular, improved subject indexing. Adopting a perspective of user-centred indexing, in this paper we propose a concept for subject indexing of data for the case of social science survey data. Secondary survey users are looking for individual data that refer to speciﬁc variables of interest for their own research. Thus the primary information source for survey indexing should be the questionnaire, since the variables or constructs are represented in the questions that have been asked. However, due to operationalization processes in survey development, the studied constructs are mostly hidden in the verbalization (latent subject content). Indexable concepts are therefore to be found on two different semantic levels, that we, inspired by research on indexing of images (Shatford 1986) treat as ofness and aboutness of survey data. This approach results in a great quantity of indexing terms per survey, leading to false associations and low precision in retrieval because questions related to the same construct are indexed by different keywords (e.g. worshipping vs. church attendance). Therefore we apply a syntax of term linking and role operators, combining directive terms (e.g. attitude, experience, perception) with subject terms (e.g. corruption, homosexuality). Each directive and subject term combination represents a measurable unit of interest to the secondary researcher. In order to provide satisfactory recall as well, we use a social science thesaurus that allows us to control for synonyms and term relationships.

References

HEY, T., TANSLEY, S. and TOLLE, K. (Eds.) (2009): The Fourth Paradigm. Data-Intensive Scientiﬁc Discovery. Microsoft Research, Redmond. SHATFORD, S. (1986): Analyzing the Subject of a Picture: A Theoretical Approach. Cata- loging and Classiﬁcation Quarterly, 6(3), 39–62.

Keywords

DATA SHARING, DATA RETRIEVAL, SUBJECT INDEXING

214 Front&Cover:&Illustra0on&by&Käthe&Wenzel&