Analyzing and Visualizing Spreadsheets
Total Page:16
File Type:pdf, Size:1020Kb
Analyzing and visualizing spreadsheets Analyzing and visualizing spreadsheets PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op woensdag 23 januari om 10:00 uur door F´elienneFrederieke Johanna HERMANS ingenieur in de informatica geboren te Oudenbosch. Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. A. van Deursen Co-promotor: Dr. M. Pinzger Samenstelling promotiecommissie: Rector Magnificus voorzitter Prof. dr. A. van Deursen Technische Universiteit Delft, promotor Dr. M. Pinzger Technische Universiteit Delft, co-promotor Prof. dr. E. Meijer Technische Universiteit Delft & Microsoft Prof. dr. H. de Ridder Technische Universiteit Delft Prof. dr. M. Burnett Oregon State University Prof. dr. M. Chaudron Chalmers University of Technology and University of Gothenburg Dr. D. Dig University of Illinois at Urbana-Champaign Copyright c 2012 by Felienne Hermans All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author. Author email: [email protected] Acknowledgements First of all, I would like to thank Arie. Thanks for always supporting my decisions, for being there when times were a bit tough and for loads of fun and inspiration. Martin, thanks for the gazillion spelling errors you corrected in all my papers, your patience was amazing. Stefan Jager, thanks for trusting me to experiment freely at Robeco! You allowed me to really test my ideas in practice, even when they were very preliminary. Robert, thanks for several kilos of home made fries and for always, always laughing at my stupid jokes. Anja, I sincerely enjoyed all the time we spent `together', I hope there is a lot more to come. The two of you are the best paranymphs a girl could wish for. And Maartje, thanks for the awesome design of my cover. Finally, I would like to thank Jan Karel Pieterse for his extensive proof reading of a draft of this dissertation. Without his eye for detail it would have, without a doubt, contained numerous more mistakes than it does now. Delft, Felienne Hermans September 2012 i ii ACKNOWLEDGEMENTS Contents Acknowledgements i 1 Introduction 1 1.1 Spreadsheets . 1 1.2 A brief history of spreadsheets . 1 1.3 Motivation . 5 1.4 Related work . 6 1.4.1 Taxonomy of spreadsheet errors . 6 1.4.2 Automated error detection . 7 1.4.3 Spreadsheet testing . 7 1.4.4 Refactoring . 8 1.4.5 Finding high-level structures . 8 1.4.6 Modeling spreadsheets . 9 1.5 Research questions . 9 1.6 Methodology . 10 1.7 Background of this dissertation . 11 2 Extracting Class Diagrams from Spreadsheets 15 2.1 Introduction . 15 2.2 Background . 17 2.2.1 Cell types . 17 2.2.2 Pattern languages for two-dimensional languages . 18 2.2.3 Pattern grammars . 20 2.3 The Gyro approach to spreadsheet reverse engineering . 20 2.4 Pattern recognition . 21 2.4.1 Overview . 21 iii iv CONTENTS 2.4.2 Finding bounding boxes . 22 2.4.3 Cell classification . 22 2.4.4 Normalization . 23 2.4.5 Filtering . 23 2.4.6 Parsing . 23 2.5 From patterns to class diagrams . 24 2.5.1 Using annotations . 24 2.5.2 Class diagrams . 27 2.5.3 Enrichment . 28 2.6 Spreadsheet patterns . 29 2.6.1 Simple class . 29 2.6.2 Simple class including name . 30 2.6.3 Simple class including methods . 30 2.6.4 Aggregation . 30 2.6.5 Associated data . 31 2.7 Implementation . 32 2.8 Evaluation . 32 2.8.1 The data set . 32 2.8.2 Quality of chosen patterns . 33 2.8.3 Quality of mapping . 33 2.9 Discussion . 37 2.9.1 Spreadsheet limitations . 37 2.9.2 Class diagram limitations . 37 2.9.3 Beyond class diagrams . 37 2.9.4 Dealing with spreadsheets errors . 38 2.9.5 Meaningful identifiers . 38 2.9.6 Threats to validity . 38 2.10 Related work . 39 2.11 Concluding remarks . 40 3 Supporting Professional Spreadsheet Users by Generating Leveled Data- flow Diagrams 43 3.1 Introduction . 43 3.2 Spreadsheet information needs . 44 3.3 Background . 47 3.3.1 Dataflow diagrams . 47 3.3.2 Leveled dataflow diagrams . 47 3.4 Dataflow diagram extraction algorithm . 48 3.4.1 Cell classification . 48 3.4.2 Identifying data blocks . 49 3.4.3 Name resolution . 49 3.4.4 Initial dataflow diagram construction . 51 CONTENTS v 3.4.5 Name replacement . 51 3.4.6 Grouping . 51 3.5 Dataflow views . 52 3.5.1 Global view . 53 3.5.2 Worksheet view . 54 3.5.3 Formula view . 54 3.6 Implementation . 54 3.7 Evaluation . 54 3.7.1 Interviews . 55 3.7.2 Case Studies . 56 3.7.3 Conclusions . 59 3.8 Discussion . 60 3.8.1 Meaningful identifiers . 61 3.8.2 Spreadsheet comprehension . 61 3.8.3 Threats to validity . 61 3.9 Related work . 62 3.10 Concluding remarks . 63 4 Detecting and Visualizing Inter-worksheet Smells in Spreadsheets 65 4.1 Introduction . 65 4.2 Related Work . 66 4.3 Background & motivating example . 67 4.4 Inter-worksheet smells . 69 4.4.1 Inappropriate Intimacy . 69 4.4.2 Feature Envy . 70 4.4.3 Middle Man . 70 4.4.4 Shotgun Surgery . 70 4.5 Detecting inter-worksheet smells . 71 4.5.1 Inappropriate Intimacy . 71 4.5.2 Feature Envy . 72 4.5.3 Middle Man . 72 4.5.4 Shotgun Surgery . 72 4.5.5 Determining the thresholds . 74 4.6 Visualizing inter-worksheet smells . 75 4.7 Implementation . 75 4.8 Evaluation . 77 4.8.1 Inter-worksheet smells in the EUSES Corpus . 77 4.8.2 Inter-worksheet smells in ten real-life case studies . 80 4.8.3 Conclusions . 84 4.9 Discussion . 85 4.9.1 VBA code, pivot tables and charts . 85 vi CONTENTS 4.9.2 Mental model of a spreadsheet . 85 4.9.3 Understanding spreadsheet design decisions . 85 4.9.4 Threats to validity . 86 4.10 Concluding remarks . 86 5 Detecting Code Smells in Spreadsheet Formulas 89 5.1 Introduction . 89 5.2 Formula smells . 90 5.2.1 Multiple Operations . 90 5.2.2 Multiple References . 90 5.2.3 Conditional Complexity . 91 5.2.4 Long Calculation Chain . 91 5.2.5 Duplicated Formulas . 91 5.3 Formula metrics . 92 5.4 Determining smell thresholds . 93 5.5 Risk maps . ..