Towards a Sustainable Geoprocessing Environment at Statistics Netherlands Through Performance Benchmarking

Towards a sustainable geoprocessing environment at Statistics Netherlands through performance benchmarking Sandra Desabandu June 2015 Professor: prof. dr. ir. P.J.M. van Oosterom (TU Delft) Supervisor GIMA: ir.E.Verbree (TU Delft) Supervisor Statistics Netherlands: ir. Pieter Bresters 1 Towards a sustainable geoprocessing environment at Statistics Netherlands through performance benchmarking A research into the applicability of performance benchmarking within proprietary off the shelve GIS of Esri and lessons learned of similar organizations for decision making on a modern, sustainable geo-ict infrastructure. Date: 01.06.2015 Author: Sandra Desabandu Student number: 3686841 (University of Utrecht) E-mail: [email protected] Supervisor GIMA: ir. E.Verbree E-mail: [email protected] Supervisor Statistics Netherlands: Peter Bresters E-mail: [email protected] Professor: Prof.dr.ir. P.J.M van Oosterom E-mail: [email protected] 2 Preface Computer performance is a common problem for anyone working regularly with often accepted as a daily annoyance. For anyone working with Geographic Information Systems, performance can be a significant bottleneck that slows down analysis and visualization of geographic data. For an organization such as Statistics Netherlands, that performs complicated spatial calculations with datasets of (in many cases) 8.000.000 records, satisfactory performance is almost a business-critical requirement. This has made the set-up and conducting of this research project especially challenging because scientific research within such an important real life process of a large public agency requires to step back at first, to find your way within related research and to set up a research strategy. At the same time, it is equally important to use the knowledge of the very motivated, experienced and skilled staff members of Statistics Netherlands. In this environment, a performance benchmark had to be devised, within the constraints of a real life organization and its results had to lead to an advice on the future geo-ict infrastructure. Therefore, the research had to be performed from different angles: technical performance factors to be used for the benchmark, but also organizational factors that could influence the advice. Ex The real data scenarios will be executed on the big data computer in ArcGIS 10.2. This is not directly comparable with ArcGIS 10.1. periences from other organizations have therefore been an important part of the thesis. I would like to thank Statistics Netherlands for providing the opportunity to conduct my thesis research and Pieter Bresters as my supervisor from Statistics Netherlands for valuable support and advice and introducing me very well into the Spatial Statistics department. I’m very grateful also for all the very useful input and ideas from the staff members of Spatial Statistics and head of the department Eric Fokke for taking care so well of the organization of the internship. GIMA has been a great place to learn about the vast field of Geographic Information and about scientific research. I owe many thanks to my supervisor Edward Verbree and my supervising professor Peter van Oosterom for their constructive feedback. Also, I would like to acknowledge the role of Esri: Without support, input and ArcGIS Pro testing facilities from Esri, this research would not have been possible. I want to thank especially Ernst Eijkelenboom for all support. The draft version of the thesis has been provided to Esri for reviewing. Very valuable information on geoprocessing performance experiences have been shared by a number of organizations where geoprocessing of large datasets is an important process and ESRI software is used: PBL (Martijn Spoon and Arjan van der Put) Statistics Portugal (Bartolomeus Schoenmakers) Statistics Italy (Rossella Molinaro) United States Geological Survey (Curtis Price) Additionally, I also would like to acknowledge the contribution of Aris (Eddy Scheper and Anke Keuren) as an organizations that provided insight into migration from ArcInfo Workstation to ArcGIS Desktop. Conducting such a research is interesting and fun, but also very heavy, especially if it is in combination with the “regular” work and family. Without support of my family, especially my husband, children, parents and parents in law, this task would have been impossible. My husband Marijn Zuurmond stands out especially: foremost for his patience and loving support. Sandra Desabandu 3 Abstract The research project has aimed to support decision making regarding possible transition towards an up-to- date, sustainable geo-ict architecture based on an analysis of technical factors that influence performance of geoprocessing tools at the spatial team of Statistics Netherlands and experiences at comparable organizations. The current geoprocessing environment is ArcInfo Workstation and ArcGIS Desktop 10.1, although most production processes are conducted in ArcInfo Workstation and its scripting language AML. Support of ArcInfo Workstation will not be continued after 2015. Therefore, migration of these processes to ArcGIS Desktop 10.1 or higher version of this product suite has to be considered. Consequently, the following research question has been formulated: Which alternatives to the current geo-ict infrastructure can be proposed for Statistics Netherlands that meet performance requirements of its geoprocessing activities and are suitable for implementation within the organizational constraints of Statistics Netherlands? Based on studying the workload, resources and performance bottlenecks at Statistics Netherlands, a benchmark has been developed evaluating four geoprocessing tools (UNION, INTERSECT, DISSOLVE and NEAR) to be tested on scalability (using synthetic data), impact of optimization factors available in ArcGIS Desktop 10.1 and impact of big data hardware and new software releases ArcGIS 10.2 and ArcGIS Pro (using real data). A number of administrative processing tools (SUMMARY STATISTICS, FREQUENCY, CALCULATE and JOIN FIELD) has been additionally included for tests with the big data hardware and the new software releases. The results of the benchmark have been combined with migration experiences of ArcInfo to ArcGIS Desktop or ArcSDE from other organizations and held against possible internal and external restrictions and trends. These organizations are Statistics Portugal, Statistics Italy, United States Geological Survey and PBL (Netherlands Environmental Assessment Agency). The benchmark results showed different results per tool: Whereas UNION and INTERSECT show the same performance in ArcGIS desktop 10.1 as in ArcInfo Workstation, the DISSOLVE and the JOIN are considerable slower in all higher desktop versions. The selected geoprocessing tools UNION, INTERSECT, DISSOLVE and NEAR did not show improvement with the use of available optimization factors (spatial index, spatial sort, parallel processing environment, compacting and compressing) in ArcGIS 10.1, although the use of optimization factors has not been exhaustive. The most remarkable results showed a decline in performance, for example compression of the input datasets. The NEAR shows no difference between ArcInfo Workstation and ArcGIS 10.1 on the fat client, but showed a big improvement in ArcGIS 10.2 and ArcGIS Pro on the big data computer. The improvement of the NEAR implicates that the algorithm of the tool has been redesigned. The other tools, also a number of administrative processing tools tested for Statistics Netherlands, did not show substantial improvement within a big data hardware environment and within ArcGIS Pro. The lessons learned that have been obtained by the interview and questionnaires can be formulated in several points: 1. The transition has been a gradual process for the respondent organizations, but for Statistics Netherlands the urgency to migrate is higher because of the high number of production processes written in AML that need to be migrated because of the production load in AML and the product cycle of ArcInfo Workstation. 2. The influence on ICT infrastructure decision making in terms of facilities for geoprocessing will depend on the user base and the organization of the GIS users: 3. The PBL shows interesting options in optimization, but also has a more flexible (Geo) ICT infrastructure. 4 4. Open Source or other proprietary software is largely new territory and not really investigated on functionality and performance. 5. For most the Statistical agencies and the PBL some part of the migration has been outsourced, at least the initial phase of the migration. 6. The evaluation of the migration to a higher ArcGIS version is mixed and varies per tool or model. It is also dependent on the user needs of the organization. To provide a migration advice that also takes into account future developments, a number of trends in geoprocessing technology have been described, such as cloud computing, MapReduce/Hadoop. Applying these technologies would require a substantial investment in knowledge and adaption of production processes. Moreover, solutions that affect the ICT infrastructure as a whole, like cloud computing or could affect the results of production of statistics by using algorithms like MapReduce are less likely to be implemented within the near future. Based on the previous information, the following conclusion can be stated: Given the limited scalability and end of support of ArcInfo Workstation, migration to a more up-to-date infrastructure is unavoidable. Three alternatives are possible:

Load more