Workflow-Driven Geoinformatics Applications and Training in the Era

İlkay ALTINTAŞ, Ph.D. Chief Officer, San Diego Center Division Director, Cyberinfrastructure Research, Education and Development Founder and Director, Workflows for Data Science Center of Excellence SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Education • Established as a national supercomputer resource center in 1985 by NSF • A world leader in HPC, data-intensive computing, and scientific data management • Current strategic focus on “Big Data”, “versatile computing”, and “life sciences applications”

Recent Innovative Architectures • Gordon: First Flash-based Supercomputer for Data-intensive Apps • Comet: Serving the Long Tail of Science Data Science Today is Both a Big Data and a Big Compute Discipline Requires: • Data management • Data-driven methods • Scalable tools for dynamic coordination COMPUTING AT and resource SCALE BIG DATA optimization • Skilled interdisciplinary workforce Enables dynamic data-driven applications

Computer-Aided Drug Discovery Smart Cities Disaster Resilience and Response New era of data science!

Smart Manufacturing Personalized Precision Medicine Smart Grid and Energy Management Needs and Trends for the New Era Data Science Ultimate Goal Action Insight Big Data Big Data Science How does successful data science happen?

Insight Data Product Exploratory Analysis Insight “Big” Data and Modeling Question Insights amplify the value of data…

…, but there are many ways to get to insights. Approach: Focus on Process and Team Work

ACQUIRE PREPARE ANALYZE REPORT ACT … Create an Ecosystem that Enables Needs and Best Practices

ACQUIRE PREPARE ANALYZE REPORT ACT

• data-driven • accountable • dynamic • reproducible • process-driven • interactive • collaborative • heterogeneous What would it such an ? ecosystem look like? Creating a Collaborative Data Science Ecosystem on top of Advanced Infrastructure What are some challenges specific to atmospheric sciences? Geospatial Big Data Drone imagery • Flood of new data sources and types • Needs new data management, storage and analysis Real-time sensors methods • Too big for a single server, fast growing data volume • Requires special database structures that can handle data variety Satellite imagery • Too continuous for analysis at a later time, with increasing streaming rate, i.e., velocity • Varying degrees of uncertainty in measurements, and Weather forecast other veracity issues • Provides opportunities for scientific understanding at different scales more than ever, i.e., potential high value Sea Surface Temperature Measurements The ‘scalability’ bottleneck

• Resources needed for geospatial big data (e.g., satellite imagery) analysis exceed current capabilities, especially in an on-demand fashion • Cloud computing is an attractive on-demand decentralized model • Need new scheduling capabilities • on-demand access to a shared configurable resources • programmable networks, servers, storage, applications, and services • Need ability to easily combine users environment and community tools together in a scalable way • Various tools with different computing scalability needs • Cost!!! The ‘sensor data’ bottleneck

• Data streaming in at various rates • “Big Data” by definition in its volume, variety, velocity and viscosity • Need to improve veracity and add value by providing - and standards-aware on-the-fly archival capabilities • QA/QC and automate (real-time) analysis of streaming data before it is even archived. • Often low signal-to-noise ratio requiring new methods • Need for integration of new streaming data technologies The “workforce” bottleneck

• Geospatial data processing requires a lot of expertise • GIS, domain expertise, data engineering, scalable computing, , … • No open geospatially enabled big data science education platform • Teach not just technical knowledge, but collaborative work culture and ethics Using workflows to get there… Workflows for Data Science WorDS.sdsc.edu Center of Excellence at SDSC

Real-Time Hazards Management Data-Parallel wifire.ucsd.edu bioKepler.org

Focus on the question, • Access and query data not the technology! • Support exploratory design • Scale computational analysis • Increase reuse • Save time, energy and money • Formalize and standardize Goal: Methodology and tool • Train development to build automated and operational workflow-driven solution architectures on big data and HPC platforms. Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu How can I get smart people to collaborate and communicate Focus on the question, to analyze data and not the computing to generate technology! insight and solve a question? Programmability Ease of use, iteration, interaction, re-use, re-purpose Reproducibility Ability to validate, re-run, re-play Scalability Reporting From local experiments to large-scale runs

Workflow Execution Execution Workflow Review Workflow Deploy Scheduling Design and and Publish Execution Planning Process for Practice Workflow Provenance of Data Science Monitoring Analysis

BUILD and SHARE SCALE LEARN EXPLORE and and ITERATE REPORT P’s in PPoDS

People Programmability

Process ? Purpose

Platforms Example: Using geospatial big data for wildfire predictions WIFIRE: A Scalable Data-Driven Monitoring, Dynamic Prediction and Resilience Cyberinfrastructure for Wildfires http://wifire.ucsd.edu Monitoring Visualization Big Data Fire Modeling Closing the Loop using Big Data -- Wildfire Behavior Modeling and Data Assimilation -- • Computational costs for existing models too high for real-time analysis • a priori -> a posteriori • Parameter estimation to make adjustments to the (input) parameters • State estimation to adjust the simulated fire front location with an a posteriori update/measurement of the Conceptual Data Assimilation Workflow with actual fire front location Prediction and Update Steps using Sensor Data Fire Modeling Workflows in WIFIRE

firemap.sdsc.edu

Fire perimeter

Real-time sensors

Landscape data

Monitoring & Weather forecast fire mapping Data-Driven Fire Progression August 2016 – Blue Cut Fire Prediction Over Three Hours

Tahoe and Nevada Bureau of Land Management Cameras: 20 cameras added with field-of-view

Collaboration with LA and SD Fire Departments http://firemap.sdsc.edu Some Machine Learning Case Studies

• Smoke and fire perimeter detection based on imagery • Prediction of Santa Ana and fire conditions specific to location • Prediction of fuel build up based on fire and weather history • NLP for understanding local conditions based on radio communications • Deep learning on multi-spectra imagery for high resolution fuel maps • Classification project to generate more accurate fuel maps (using Planet Labs satellite data) Classification project to generate more accurate fuel maps • Accurate and up-to-date fuel maps are critical for modeling wildfire rate of speed and potential burn areas. • Challenge: • USGS Landfire provides the best available fuel maps every two years. • The WIFIRE system is limited by these potentially 2-year old inputs. Fuel maps created at a higher temporal frequency is desired. • Approach:

• Using high-resolution satellite imagery and deep Cluster 1: Short Grass learning methods, produce surface fuel maps of San Diego County and other regions in Southern California. • Use LandFire fuel maps as the target variable, the objective is create a classification model that will provide fuel maps at greater frequency with a measure of uncertainty. Fig. 5: Cluster histogram showing the image tiles. Cluster numbers increase from left to right. Reused in Built Infrastructure and Demographic Analysis

Data Preparation Tiles Data Preparation Pan-Sharpen Reproject Create RGB Downsample Crop Images Satellite Imagery Pan-Sharpen Reproject Create RGB Downsample Crop Images Tiles Satellite Imagery

Machine Learning Machine Learning Histogram and Map Histogram and Map Feature Ordered CNN Cluster PCA Sort Feature Ordered Extraction CNNClusters Cluster PCA Sort Extraction Fig. 6: High resolution tiled display showingClusters cluster histogram. Fig. 5: Cluster histogram showing the image tiles. Cluster numbers increase from left to right. Fig. 2: The Integrated Processing Pipeline for Data Preparation and Unsupervised Learning Fig. 2: The Integrated Processing Pipeline for Data Preparation and Unsupervised Learning in the home leads to higher per capita resource consumption. V. V ISUALIZATION AND MAPPING OF CLUSTERS This approach is complimentary to ours in that the agent-based modeling predicts the demographic dynamics. The results of the clustering are visualized in a web portal in the home leads to higher perthat capita can resource be viewed consumption. using any browser. The first part of the III. DATASET PREPARATION This approach is complimentaryportal to ours displays in that the the cluster agent-based histogram where each element is a The analysis for this project was performed on imagery modeling predicts the demographic200x200 dynamics. pixel tile as shown in Figure 5. Users can zoom in supplied by Digital Globe. We purchased a WorldView3 8- and pan around the histogram to see details of the individual band satellite image for a portion of Mumbai, India. The tiles. imagery was then preprocessed for the CNN using GDAL [14]. III. DATASET PREPARATION The image was pan-sharpened (the multispectral image reso- lution was sharpened with the panchromatic image that was Visually exploring this histogram, one can see the urban taken with it), reprojected the map projection to WGS84, and The analysis for this projectspaces was fall performed mostly to on the imagery left, and open space on the right. the red, green, and blue bands extracted to create a color-only supplied by Digital Globe. WeInformal purchased settlements a WorldView3 are interspersed 8- in the middle and to the RGB image. The image then was scaled from 16-bit to 8-bit right. for the analysis and easy viewing of the results. band satellite image for a portion of Mumbai, India. The We then wrote a Python script using GDAL commands Fig. 4: Samplesimagery of image was tiles then in a cluster preprocessed of Northeast trending for theThe CNN same using histogram GDAL is [14].also viewable on tiled display, Fig. 7: Heat map showing concentrations of slum cluster. to split the imagery to 200x200-pixel tiles, which enables roads, andThe heat map image of all was image pan-sharpened tile locationsFig. in that 6: High cluster. (theas resolution shown multispectral tiled in displayFig 6, showingimage for deeper cluster reso- histogram. high resolution collabora- our algorithms to focus on smaller sub-regions than the large lution was sharpened with the panchromatictive explorations image of the thatresults. was Detailed visual analytics are land areas covered by high resolution satellite scenes. In a screenshot of their associated heat map displaying thepossible tile using the tiled display wall by seeing all 300,000 short, we illustrate in Figure 3 how we crop each tile into locations.taken withV. V it),ISUALIZATION reprojected AND MAPPING the map OF C projectionLUSTERS to WGS84, and smaller thumbnails for downstream analysis. For our section the red, green, and blue bands extractedtiles at once. to create Although a color-only the tiled display is very useful for In the web interface, clicking on a tile opens a new page in of Mumbai, one 3.4GB image created approximately 30,000 The results of the clustering are visualized in a web portal Our methods focus on the city of Mumbai but we arescientific also discussions and in depth analysis of images, web the portal that is divided into two sections: on the left is a small tiles. RGBthat image. can be viewed The using image any browser. then was Theinterface first scaled part makes of from the it 16-bitpractical to for 8-bit working with the images in any version of the histogram where the x-axis in the cluster and interested in expandingportal displays and adapting the cluster our methodology histogram where to other each element is a cities. Wefor therefore200x200 the analysis downloaded pixel tile and asscenes shown easy from in Figure otherviewing cities 5.environment. Users to of can the zoom results. Thisin modality makes our histogram visualization the y-axis is the tile number, and on the right is a responsive test the transferabilityand pan of around our methods the histogram and answer to see fundamental detailsapproach of the individual scalable to many scenarios. map of Mumbai. The map is constructed using Leaflet [27], questions aboutWetiles. the urban then similarity wrote and a differences Python of Mumbai script using GDAL commands Fig. 4: Samples of image tiles in a cluster of Northeast trending to other cities worldwide. We selected six other cities, namely, roads, and heat map of all image tile locations in that cluster. Delhi (India),to split RioVisually de the Janeiro exploring imagery (Brazil), this Johannesburg histogram, to 200x200-pixel one (South can see the tiles,urban which enables Africa), Londonourspaces algorithms (United fall Kingdom), mostly to to focus the and left, New on and York smaller open and space San sub-regions on the right. than the large Diego (USA).Informal The imagery settlements for these are cities interspersed was downloaded in the middle and to the from Digitalland Globe’sright. areas Global covered Basemap [15], by an imagery high serviceresolution satellite scenes. In a screenshot of their associated heat map displaying the tile that enablesshort, a subscriber we to illustrate search and download in Figure available 3 im-how we crop each tile into locations. agery from the DigitalThe Globe same archive histogram from is all also of their viewable available on tiled display, Fig. 7: Heat map showing concentrations of slum cluster. as shown in Fig 6, for deeper high resolution collabora- satellite sources.smaller The thumbnails Global Basemap for data downstream is searched and analysis. For our section tive explorations of the results. Detailed visual analytics are downloadedof using Mumbai, a web interface one 3.4GB that enables image searching created via approximately 30,000 possible using the tiled display wall by seeing all 300,000 keywords or coordinate searches similar to a Google Maps Our methods focus on the city of Mumbai but we are also tiles.tiles at once. Although the tiled display is very useful for In the web interface, clicking on a tile opens a new page in interface. When zoomed in, the available imagery for a region scientific discussions and in depth analysis of images, web the portal that is divided intointerested two sections: onin the expanding left is a small and adapting our methodology to other Fig. 3: Cropping satellite tiles to thumbnails. shows in a right-sided pane to scroll through the dates for when interface makes it practical for working with the images in any version of the histogram wherecities. the x-axis We thereforein the cluster and downloaded scenes from other cities to imagery is available.environment. Each image This can modality be manually makes our reviewed histogram for visualization the y-axis is the tile number, and on the right is a responsive clarity, cloud coverage,approach andscalable seasonality. to many Images scenarios. are available map of Mumbai. The maptest is constructed the transferability using Leaflet [27], of our methods and answer fundamental To associate the spatial reference for the images, a second from all satellite image products that Digital Globe provides. Python script was written to extract corner coordinates from Images can then be selected for download by being added questions about the urban similarity and differences of Mumbai the metadata of each tile and used to create a database of tile to the user’s library. When the imagery is extracted from to other cities worldwide. We selected six other cities, namely, names and their coordinates. The satellite image tiles were fed the Digital Globe archive to the user’s library, the user can into the feature extraction framework (described below), and download the image to their hard drive from the web interface. Delhi (India), Rio de Janeiro (Brazil), Johannesburg (South then each of the clusters were mapped. Figure 4 shows samples These images were downloaded and then tiled in the steps Africa), London (United Kingdom), and New York and San of a tile cluster of roads oriented Northeast and Southwest, and above. Diego (USA). The imagery for these cities was downloaded from Digital Globe’s Global Basemap [15], an imagery service that enables a subscriber to search and download available im- agery from the Digital Globe archive from all of their available satellite sources. The Global Basemap data is searched and downloaded using a web interface that enables searching via keywords or coordinate searches similar to a Google Maps interface. When zoomed in, the available imagery for a region Fig. 3: Cropping satellite tiles to thumbnails. shows in a right-sided pane to scroll through the dates for when imagery is available. Each image can be manually reviewed for clarity, cloud coverage, and seasonality. Images are available To associate the spatial reference for the images, a second from all satellite image products that Digital Globe provides. Python script was written to extract corner coordinates from Images can then be selected for download by being added the metadata of each tile and used to create a database of tile to the user’s library. When the imagery is extracted from names and their coordinates. The satellite image tiles were fed the Digital Globe archive to the user’s library, the user can into the feature extraction framework (described below), and download the image to their hard drive from the web interface. then each of the clusters were mapped. Figure 4 shows samples These images were downloaded and then tiled in the steps of a tile cluster of roads oriented Northeast and Southwest, and above. Summary

• Geospatial big data has all the typical big data challenges • Lessons learned from other disciplines to deal with these challenges should be applied • Workflows can be used both for managing scalable coordination and training students and workforce • Dynamic data-driven integration of machine learning, data assimilation and modeling is of potential use to many geo applications WIFIRE Team: It takes a village!

• PhD level researchers • Professional software developers • 32 undergraduate students • UC San Diego • UC Merced • Monash University Calit2/QI- • University of Queensland SDSC - Cyberinfrastructure, GIS, • 1 high school student Cyberinfrastructure, Advanced Visualization, Machine Learning, • 4 MSc and 5 MAS students Workflows, • 2 PhD students (UMD) Data engineering, Urban Sustainability, Machine Learning, HPWREN • 1 postdoctoral researcher Information Visualization, HPWREN UMD - Fire modeling

UCSD MAE - Data assimilation SIO - HPWREN Questions? Part Part of the presented presented the of work isNSF, by work funded DiegoSan UC variousNIH, and industrypartners. DOE,

WorDS Director: Ilkay Altintas, Ph.D. Email: [email protected]