Designing the Next Generation, Vertically Integrable Statistical Software Environment
Total Page:16
File Type:pdf, Size:1020Kb
A Service of Leibniz-Informationszentrum econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible. zbw for Economics Ziegenhagen, Uwe; Klinke, Sigbert; Härdle, Wolfgang Karl Working Paper Yxilon: Designing The Next Generation, Vertically Integrable Statistical Software Environment Papers, No. 2004,40 Provided in Cooperation with: CASE - Center for Applied Statistics and Economics, Humboldt University Berlin Suggested Citation: Ziegenhagen, Uwe; Klinke, Sigbert; Härdle, Wolfgang Karl (2004) : Yxilon: Designing The Next Generation, Vertically Integrable Statistical Software Environment, Papers, No. 2004,40, Humboldt-Universität zu Berlin, Center for Applied Statistics and Economics (CASE), Berlin This Version is available at: http://hdl.handle.net/10419/22213 Standard-Nutzungsbedingungen: Terms of use: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Documents in EconStor may be saved and copied for your Zwecken und zum Privatgebrauch gespeichert und kopiert werden. personal and scholarly purposes. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle You are not to copy documents for public or commercial Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich purposes, to exhibit the documents publicly, to make them machen, vertreiben oder anderweitig nutzen. publicly available on the internet, or to distribute or otherwise use the documents in public. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, If the documents have been made available under an Open gelten abweichend von diesen Nutzungsbedingungen die in der dort Content Licence (especially Creative Commons Licences), you genannten Lizenz gewährten Nutzungsrechte. may exercise further usage rights as specified in the indicated licence. www.econstor.eu Yxilon – Designing The Next Generation, Vertically Integrable Statistical Software Environment∗ Wolfgang H¨ardle1,2, Sigbert Klinke2 and Uwe Ziegenhagen1,2 Humboldt-Universit¨atzu Berlin 1Center for Applied Statistics and Economics 2Institute for Statistics and Econometrics {haerdle, sigbert, ziegenhagen}@wiwi.hu-berlin.de July 31, 2004 Abstract Modern statistical computing requires smooth integration of new algorithms and quantitative analysis results in all sorts of platforms such as webbrowsers, standard and proprietary applica- tion software. Common statistical software packages can often not be adapted to integrate into new environments or simply lack the demands users and especially beginners have. With Yxilon we propose a vertically integrable, modular statistical computing environ- ment, providing the user a rich set of methods and a diversity of different interfaces, including command-line interface, web clients and interactive examples in electronic books. This archi- tecture allows the users to rely upon only one environment in order to organize data from a variety of sources, analyse them and visualize or export the results to other software programs. The design of Yxilon is inspired by XploRe, a statistical environment developed by MD*Tech and Humboldt-Universit¨atzu Berlin. Yxilon incorporates several ideas from recent develop- ments and design principles in software engineering: modular plug-in architecture, platform independence, and separation of user interfaces and computing engine. Keywords Java, Client/Server, XploRe, Yxilon, electronic publishing, e-books ∗Financial support was received from Society for Economics and Management at Humboldt-Universit¨atzu Berlin. 1 The Past and Presence of Statistical Software “Each new generation of computers offers us new possiblities, at a time when we are far from using most of the possibilities offered by those already obsolete.” John W. Tukey (1965) Before we discuss the future of statistical software it is worth to look at past and present tools for a statistican’s daily work. Back in the 60s of the last century, where several commercial software packages originally date from, data analysis was an uphill struggle. Computers were scarce, expensive and, compared with the latest generation one can buy in each supermarket today, extremely slow. Furthermore these machines had to be operated by specially trained personnel and computing time even had to be reserved in advance. When we look at the market for statistics software today, we find a large variety of free or com- mercial packages as S-Plus, SPSS, Minitab and R, being able to handle large data in a split second. But really astonishing is that the product used for most analysis tasks worldwide is Microsoft Excel, although there are certain doubts about its numerial precision (McCullough and Wilson, 1999). Why is Excel thus attractive for so many users? This must have to do with the way we do data analysis. Following Chambers (2000) we can split this job into three different subtasks: Organisation, Analysis and Presentation. In this sequence data analysis is done today. Table 1 points out the pros and cons in these three subtasks. We see, Excel may be a good choice for small or medium tasks, but may not be sufficient for a deeper analysis of large datasets. Here ”dedicated” statistics packages find their market. The ability to compete against a standard spreadsheet package may be taken as one requirement but what other requirements does a statistics package need to fulfill? pro contra Organisation large variety of import tables limited to 65.536 & exportfilters, filter and rows and 256 columns database functions Analysis numerous functions for numerical inaccuracies statistical analysis, e.g. for ANOVA, Fourier Ana- lysis, Regression and Sam- pling Presentation graphics connected with no statistical displays as data dynamically boxplots, histograms, etc. Table 1: Performance of Excel in Organisation, Analysis and Presentation This function is discussed in the next section, where we set up a list of 12 essentials for an universal, next generation statistical computing environment. In Section 3 we argue that vertical integration is the key for successful data analysis, proloferation of methods and report generation. In Section 4 we sketch the design of Yxilon (Y es, X ploRe I s Living On), a prototype for a future, vertically integrated statistical computing environment. 2 In the last section we summarize our thoughts. All information may also be found on the projects webpage http://www.quantlet.org. 2 Requirements for Statistical Environments Chambers and Lang (1999) give a list of essentials that are needed and desirable for statistical tools: 1. Usable from multiple front-ends, i.e. the data analysis must be performed from Excel, Web- browsers and other user interfaces. 2. Support for development of GUIs for different audiences. Each discipline has its own culture of naming statistical phenomena, GUIs must reflect this feature. 3. Extensibilty on language/interpreter level and native core level, i.e. high level code must be converted to high speed production code. 4. Internet abilities to read and write data to networks 5. Database support to allow nearly real-time analysis 6. Interactive graphics and connection to other graphical controls. Teachers and students like to show and study the sensitivity of the implemented statistical methods. 7. Support for multi-processor machines 8. Extensibility by inclusion of existing code 9. Optimization for performance The view of an econometrician on methods and data differs from the view a biometrician has although they might use exactly the same statistical techniques. A Japanese statistician has different needs for supporting help system than say a French statistician, thus there is a need for more than one language interface. Reports, tutorials and newsgroups are ressources that influence the choice of the statistical environment. We therefore add several essentials: Set of methods: The included set of methods is for sure one of the most crucial points when selecting a statistical engine. Although there is a common subset of methods, most commercial software programs show huge differences in the included method sets, most producers provide (expensive) add-ons for special analysis tasks. Multiple language support: English is the lingua franca of science but for non-native users the usability of the software increases significantly when graphical user interface, hints and espe- cially error messages are given in their own tongue. Valuable user ressources: The available user ressources as manuals in printed and electronic form, tutorials and on-line help are the access key to the software. While experienced users often need just an index of the available functions, novices and students require more assistance in the form of tutorials and sufficient manuals. These should also provide substantial help on the theoretical background of the available methods, since, as Tukey (1965) stated: ”Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do.”. 3 3 Why do we need vertical integration? Vertical integration means the smooth integration of statistical computing frameworks into com- pletely different environments such as web browsers, electronic or printed books and standard ap- plication software. Table 2 depicts three examples, how this vertical integration may look like in practice: In scenario 1 Microsoft Excel reads data from a file or database and hands them over to an embedded computing module, for the graphical presentation of the results internal Excel routines are used. Scenario 2 shows