This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 1

Practical Adoption of Computing in Power Systems – Drivers, Challenges, Guidance, and Real-world Use Cases IEEE Task Force on for Power Grid

Song Zhang (TF Chair), Senior Member, IEEE, Amritanshu Pandey (TF Vice-Chair), Member, IEEE, Xiaochuan Luo (TF Vice-Chair), Senior Member, IEEE, Maggy Powell, Member, IEEE, Ranjan Banerji, Member, IEEE, Lei Fan, Senior Member, IEEE, Abhineet Parchure, Member, IEEE, Edgardo Luzcando, Member, IEEE

 computing services that provides essential virtualized Abstract—Motivated by FERC’s recent direction and ever- resources such as computing, storage and networking over growing interest in cloud adoption by power utilities, a Task Force the was established to assist power system practitioners with secure, Platform-as-a-Service – a category of cloud computing reliable and cost-effective adoption of cloud technology to meet services that allows customers to provision, instantiate, run, various business needs. This paper summarizes the business drivers, challenges, guidance, and best practices for cloud and use a modular bundle comprising of a computing adoption in power systems from the Task Force’s perspective, platform and one or more applications without the need for after extensive review and deliberation by its members that users to perform essential management (e.g., patching) include grid operators, utility companies, software vendors and Software-as-a-Service – a category of cloud computing cloud providers. The paper begins by enumerating various services where a vendor hosts applications and deliver them business drivers for cloud adoption in the power industry. It to end users over the Internet on a subscription basis follows with the discussion of challenges and risks of migrating power grid utility workloads to cloud. Next for each corresponding Function-as-a-Service – a category of cloud computing challenge or risk, the paper provides appropriate guidance. services that allow users to develop, deploy and run single- Importantly, the guidance is directed toward power industry purpose applications by modules without having to manage professionals who are considering cloud solutions and are yet servers hesitant about the practical execution. Finally, to tie all the sections Container-as-a-Service – a category of cloud computing together, the paper documents various real-world use cases of services that allows software developers and IT departments cloud technology in the power system domain, which both the power industry practitioners and software vendors can look to upload, organize, run, scale, and manage containers by forward to design and select their own future cloud solutions. We using container-based virtualization hope that the information in this paper will serve as useful Cloud Service Provider – a company that offers guidance for the development of NERC guidelines and standards components of cloud computing such as the aforementioned relevant to cloud adoption in the industry. services (e.g., cloud-based services and infrastructure) Elastic Computing – a cloud computing characteristic Index Terms—Cloud computing, cloud adoption, public cloud, that denotes secure and resizable compute capacity to meet cyber security control, compliance, service reliability, fault- tolerant architecture, resilient infrastructure, operations and changing demands without fixed preplanning for capacity planning. and engineering for peak usage – a cloud computing execution KEY TERMINOLOGIES model in which the cloud provider provides an execution IT/OT – Information Technology and Operational environment that does not require users to manage servers Technology Vertical Scaling (Scaling Up) – scaling by adding more Infrastructure-as-a-Service – a category of cloud resources to a single node such as additional CPU, RAM, and DISK to cope with an increasing workload

S. Zhang and X. Luo are with ISO New England, Holyoke, MA 01040 USA (e-mail: [email protected], [email protected]). A. Pandey was with Pearl Street Technologies, Pittsburgh, PA, USA. He is now with the Department of ECE, Carnegie Mellon University, Pittsburgh, PA 80523 USA (e-mail: [email protected]). M. Powell, R. Banerji, and A. Parchure are with , Herndon, VA 20170, USA (e-mail: [email protected], [email protected], [email protected]) L. Fan is with University of Houston, Houston, TX 77004, USA (e-mail: [email protected]) E. Luzcando was with Midcontinent ISO from 2003 to 2012 and New York ISO from 2012 to 2018. He is now with Performance Improvement Partners, Shelton, CT 06484, USA (e-mail: [email protected])

This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 2

Horizontal Scaling (Scaling Out) – scaling and increasing these grid applications because it offers fault-tolerant system resources by adding more nodes (e.g., virtual machines) to design capabilities and benefits without an equal (linear) an existing pool of nodes rather than simply adding increase in cost. Learning from numerous successful resources to a single node applications in the industries mentioned above where no fewer Regions and Availability Zones – Regions are geographic rigid requirements on cyber security and compliance are locations in which cloud service providers' data centers are imposed, the Task Force concludes that cloud technology can geographically located. Availability Zones are isolated benefit the entire power industry. locations within Regions from which cloud Despite the firm resistance to cloud adoption by the power services originate and operate; industry due to often misunderstood fundamentals, some Shared Responsibility Model – a cloud security innovative organizations have begun to use the cloud for non- framework that dictates the security obligations of a cloud critical, low-impact workloads such as planning studies and computing provider and its users to ensure accountability load forecasting. Given this background, the Task Force Spot instance – a node (e.g., virtual machine) that uses a initiated this work to systematically organize the most recent spot market pricing model for a CSP’s unused capacity. cloud applications in the electric energy sector and provide Spot instances come at a steep discount but can be shut down expert guidance for cloud adoption to power system users and when the CSP no longer has unused capacity. software vendors by combining: i) best practices endorsed by On-Demand Instance– a node (e.g., virtual machine) that the Cloud Service Providers (CSP) and ii) experiences learned you pay for by the hour or second with no long-term from the pioneering use cases of cloud technology in power commitments. industry. The goal of this work is to address the common Reserved Instance– a node (e.g., virtual machine) that concerns over the cloud adoption, provide guidance for reliable you pay a discounted rate for by reserving compute capacity and secure use of cloud resources for power entities and and committing to long term usage (for example 1 to 3 elaborate how cloud technology can help in a variety of power years). system businesses. Additionally, this document also aims to untangle common misconceptions related to cloud technology in the power industry and help software vendors design I. INTRODUCTION products that are better suited to the cloud. loud computing is a mature technology that has become The authors of the paper recommend that the readers C ubiquitous in modern business environment. Concurrently, consider the following before reading the document. The paper it is globally seen as critical infrastructure [1] like other vital is divided into four independent broad sections. Section II resources such as power, gas, and fresh water supply [2]. While describes the key business drivers for cloud adoption in the field adoption of cloud technology in the power industry faces of power systems. If the reader's interest lies solely on drivers resistance on several fronts, e.g., cyber security, compliance, for cloud adoption, they should read Section II. Section III of cost, consistency, latency, software pricing and licensing, there the paper documents the known challenges from the power grid is growing interest for cloud adoption amongst many grid utilities and software vendors perspective when adopting or entities driven by their business needs. In part this is driven by developing cloud technology. Section IV is closely tied to rapid grid modernization and decarbonization, which requires Section III and provides guidance to the reader for every ever-growing demand for data analytics and resources such as challenge documented in Section III. In case the reader is computing, network, and storage. Traditional on-premises interested in reviewing a specific challenge, he or she can jump facilities are facing constraints and the utilities most affected directly to the relevant sub-section in Section III and follow that are eagerly searching for scalable and cost-effective solutions with corresponding guidance for that challenge in Section IV. to meet their fast-rising needs. Cloud technology is an obvious Finally, Section V brings together real-world use cases of cloud choice. Independently, power system practitioners are technology in power grids. It combines discussion from prior continuously seeking advanced algorithms and new solution sections on drivers, challenges, and guidance to frameworks that can benefit from elastic compute resources comprehensively document some of the real-world examples of inherent in cloud computing. This, along with cost effective cloud adoption in the power industry. Readers interested in a storage options, make cloud technology an ideal option for particular cloud-based technology can look to a similar real- power system practitioners. world use case in Section V to learn how an Cloud technology has other benefits as well. The modern organization navigated the path for adopting a cloud-based power grid is a cyber-physical system that ties together the technology. physical grid infrastructure with the Information Technology/Operational Technology (IT/OT) infrastructure for II. BUSINESS DRIVERS FOR CLOUD ADOPTION reliable and resilient grid design and operation. It is desirable The advent of cloud computing has brought unprecedented for such safety-critical systems to have a fault-tolerant benefits to organizations in many business sectors, and the architecture to ensure operational continuity in the event of a power industry is no exception. Digital transformation of the disruption of IT/OT systems. Cloud computing, which is a electric grid is an important catalyst for cloud adoption. Broadly proven technology in several other industries such as finance, the business drivers for cloud technology in the power industry e-commerce, insurance and healthcare [3], is a fitting option for can be viewed from three different perspectives – resources, This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 3 solutions, and infrastructure. by grid modernization. For instance, the transmission operators A resources viewpoint: The modern power grid requires and distribution operators need a resilient IT infrastructure that flexible access to IT resources that are scalable and cost- offers abundant local and geographical redundancy to support effective. These resources include but are not limited to storage, operational continuity after blackouts and other extreme computing, and networking. Due to large-scale deployments of weather events like hurricanes, wildfires, earthquakes and new OT systems such as Phasor Measurement Unit (PMU), tsunamis. CSPs provides such resilience and redundancy by Advanced Metering Infrastructure (AMI), smart meters, and offering services from more than one geographic region, by industrial Internet of Things (IoT) devices, the data generated providing automation for monitoring, rapid reactions to such by the daily power system activities has exploded in the past devastating events, and elasticity to scale as needed. With cloud decade. As with other industries, this data is a “gold mine” for technology, power utilities can easily implement a fault-tolerant improved analysis, operation, predictive maintenance, architecture that can support their business continuity in the planning, monitoring and control; the need for which has event that their infrastructure experiences any disturbances. In become more imminent due to uncertainty and variability in addition, the 5G and IoT revolution has a deep impact on many operating patterns of the grid. These are primarily driven by the industries, including power systems. To unleash the power of proliferation of renewable energy and distributed resources, these cutting-edge technologies, power utilities need to respond risks associated with extreme weather events, ambitious goals to the business needs of OT systems such as SCADA and EMS to shift from fossil fuels to low-carbon technologies, and rapid while ensuring security, integration, visibility, control and electrification. As one would expect, the efficacy of these compatibility. Thus, power utilities need to carefully consider improved analyses depends on the storage, management and the right approach to bring 5G and IoT devices to the enterprise processing of large volumes of data collected in the grid OT workload. Cloud technology can simplify the integration of environment. How to efficiently manage and process such data these technologies and unify IT and OT systems to construct a and uncover the value behind them is a question faced by many converged system architecture. grid utilities today. Cloud computing is a practical and Fig. 1 shows a non-exhaustive list of power system business economic option for the power industry to acquire massively needs that the cloud services can support. The following scalable and elastic resources for data transmission, storage, passages use the application instances in Fig. 1 to show how processing and visualization. In only a few years CSPs they can benefit from cloud adoption. It covers a wide range of offerings have rapidly evolved; from compute technologies that power system entities including grid operators, distribution started with virtual machines to serverless computing, to utilities and market participants. From within the probable containers and enhanced orchestration of containers applications in Fig. 1 and the following paragraphs, we (Kubernetes). All the offerings enable a flexible way to acquire collected a dozen of real-world use cases and summarized their resources without a complex and time-consuming procurement adoption and experience in Section V from utility’s perspective. process (i.e. “pay-as-you-go” pricing strategy). Utility A. Planning studies companies only need to pay for the time period when they utilize the resources leased from the CSPs. As the power grid continues its modernization journey, its A solutions viewpoint: Cloud computing unlocks numerous network size and complexity is constantly increasing. There are new solution frameworks, advanced algorithms, and tools such more and more sophisticated interconnections, the addition of as data-driven approaches through data mining, Machine new devices like High Voltage Direct Current (HVDC) systems Learning (ML) and Artificial Intelligence (AI), in a highly and Flexible Alternating Current Transmission System accessible manner. These data-driven methods can be used to (FACTS) and large penetration of renewables and electrified develop online algorithms that had been traditionally difficult resources. These add many nonlinear, non-convex and ill- for model-based methods due to the lack of good model behaved characteristics to the grid analysis. To study these parameters. These algorithms have applications ranging from features accurately, a large set of discrete variables as well as control to cyber-security (anomaly detection) to weather differential and algebraic equations must be added to the grid forecasting. CSPs constantly update and add new technologies analytical model. Adding further to the complexity is the (e.g., ML/AI models) that can be immediately tested and used uncertainty and variability brought about by utility-scale by power utilities to enhance their data analysis capabilities and interconnected renewables, Distributed Energy Resources improve their business outcomes. These services based on (DERs) and microgrids that require modeling of many grid “pay-as-you-go” billing approach greatly lowers the barrier to parameters as random variables. Together these features have entry for power system users to utilize the ML methods and led to engineers having to analyze a far greater range of frameworks which typically have been highly demanding for scenarios to comprehensively evaluate the impact on the system hardware and software requirements. With cloud technology, from many different perspectives, including voltage, thermal power utilities can now take advantage of these advanced and stability. This requires an order of magnitude increase in algorithms and focus on their primary business needs, letting computational resources. the CSPs do the heavy lifting for them. Despite the need to manage fast-growing computational An infrastructure viewpoint: The IT/OT cyber infrastructure intensity and complexity, the computing resources provisioned of power system entities, whether on supply side or demand by power system organizations today not meeting the side, needs a revamp to better adapt to the challenges brought computational needs of emerging grid analyses and so-called This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 4 on-premises “supercomputers” or “computing clusters” are cost when idle. To meet the peak demand for computation while creating observed bottlenecks. These local computing resources simultaneously striking a balance between cost and efficiency, are unable to scale and inelastic. As the demand increases, they elasticity or scalability [4], which is defined as one of the key constrain the simulation efficiency and underutilize the sunk

Fig. 1. Mapping the business needs to cloud services. The hexagons in the center of the diagram represent the services/advantages of cloud, while the surrounding rectangles represent various business drivers. Numbered and colored circles are attached to distinct rectangles, indicating which business need will primarily benefit from (or rely on) what cloud services/advantages. For example, planning studies mainly take advantage of elastic computing and easy data sharing feature of cloud. features of cloud computing by NIST in [5], is needed circumstances, e.g., network outages and cloud service outages. imminently by grid operators, transmission owners and vertically integrated utilities. Elastic computing embodies not only high-performance computing and parallel computing, but also allows the users to rapidly upscale/downscale the resources as needed without worrying about the capacity planning. Besides, it eliminates the lengthy waiting time from the procurement to the deployment in an on-premises environment. With easy access to unlimited cloud resources, power system engineers can perform large-scale simulations more cost- effectively. Such supersized studies come in a variety of forms: steady state analyses, such as transmission needs assessment, installed capacity requirement study and tie benefit study [4], or dynamic studies such as transient stability assessment and cascading analysis. Moreover, with access to elastic cloud computing engineers can solve previously unsolvable time- constrained problems, for instance, generating units delist study during forward capacity auctions. Fig. 2. Using on-premises and cloud computing to meet different portions of It is worth noting that elastic computing and on-premises demand: the computing demand of a utility can be typically divided into two high performance computing are not mutually exclusive. On the parts: fixed demand and variable demand. The former indicates the predictable, contrary, they complement each other. As shown in Fig. 2, planned workload, whereas the latter refers to the bursting, unplanned workload. while the cloud resources are the best choice for meeting the variable part of the demand curve considering its rapid B. Storage and Management of Massive System Data elasticity, the on-premises computing resources are sufficient The demand for massive data storage and management has for the fixed portion of demand as predictable workload can be been gradually increasing within power system entities, even met by the resources provisioned in-advance. Moreover, the on- before the so-called “era of big data” arrived. Generating premises resources can also complement the cloud as a local purchase plans for data storage media has been one of the backup of cloud resources during certain exigent critical annual tasks undertaken by the IT department in most This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 5 power system companies since pre-big-data times and more so Before cloud became available, most AI/ML workload was now. Being the outposts for receiving user-end data, power isolated and cost-prohibitive due to data storage and processing utilities are experiencing an annual rate of intake that is requirements of the algorithms; but the economics of the cloud exponentially increasing [6]. The growth of these user- have now enabled ML capabilities in a highly accessible generated data is primarily driven by emerging technologies manner. Moreover, they have unlocked the ability to constantly such as AMI smart meters and IoT devices. In the meantime, switch to evolving algorithms without having to worry about the deployment of PMU devices across the interconnected bulk constant upgrades to hardware and licensing. By aggregating power system has also contributed significantly to the data in one-place and addressing the needs for these data-driven tremendous growth in data intake because of high sampling rate approaches, the cloud is opening doors to more real-time and increasing penetration. analysis for grid operation. With the exponentially growing data intake, the power D. Wide-area Monitoring and Situational Awareness industry is seeking a high speed, low-cost and scalable data storage approach. This is a need that many industries share No nation-wide monitoring infrastructure for the bulk power today and the power industry is no exception. Cloud technology grid exists today. Independent of the location whether North is an ideal solution for data storage of this scale, where power America, Europe or Asia, each region is managed by a regional utilities no longer would have to worry about over-provisioning system operator, which coordinates with sublevel local control or under-provisioning storage. Together with this dynamic and centers to ensure reliable monitoring and control of the grid. almost “unlimited” access to data storage, cloud offers the These regional coordinators and transmission companies have underpinning for seamlessly integrated advanced data their own software solutions for data acquisition, orchestration, management. With these data-related resources and services, IT and visualization. Data sharing occurs across regional administrators and engineers can organize, protect, process and “borders” to the extent necessary to accomplish basic validate the grid data more conveniently and effectively. coordination, but on a need-to-know basis defined by narrowly scoped peering agreements [13]. As the bulk power grid C. Data-driven Modeling and Analytics evolves in response to changes in electricity generation and As the amount of data generation and intake grows, power consumption patterns, broader system monitoring and control system companies, just like many other industries, are looking is required. Such wide-area monitoring and situation awareness to advanced algorithms and approaches to uncover the value is becoming increasingly vital to enable early warnings of behind the data they own [6]. The data-driven methods, potential risks, expedite coordinated control and problem especially ML and AI techniques, offer a good approach to solving. achieve these goals. Cloud technology offers an ideal environment for Although rarely mentioned, efforts to apply traditional multilateral data and results exchange because of its massive AI/ML methods in power systems have been made since a few network interconnectivity based on dedicated CSP networks decades ago. For instance, Artificial Neural Network (ANN) and the public Internet. The downstream applications that had been widely studied in the 1990s and was finally applied to consume such data or results can also be hosted on the cloud to load forecasting [7]. However, it encountered significant provide a shared online solution to all collaborators challenges in other applications due to limitations with both independent of spatial or size constraints. Within a decade of software and hardware resources to process large dimensions of development, cloud technology has incorporated many features input. The required time for model training and validation was to support massive real-time data streaming, cross- hence prolonged. Another instance of data-driven approach is account/cross-region data sharing and fast content delivery the use of expert systems way back in the 1980s to 1990s [8]. network, making itself the most efficient and economical Dedicated research efforts were made during those times but solution for wide-area grid monitoring. These features, ended prematurely due to the complexity of rules and the available through a variety of services by CSPs, can provide a constant need to update the rules according to the changes in consistent, secure and scalable services to the power system system condition. entities for data sharing and collaboration. Thanks to the renaissance of AI/ML research and access to E. Real-time (backup) Operations and Control much more powerful computer hardware today, the industry is ready to revisit the data-driven approaches to power systems Power grid operators are obliged to keep their essential and re-evaluate the possibility of bringing the power of AI and businesses running even during those unusual times when the ML to power systems. The idea is strengthened in parallel with control center facilities are unavailable or inaccessible. For continuous investment in communication and metering example, during the outbreak of COVID-19 the grid operators infrastructures that allow power utilities and grid operators to had to run the grid reliably and securely even with an increased obtain data at high speed and low cost. Some emerging data- likelihood of operation disruption. During these times, driven use cases include behind the meter (BTM) solar operators lacked timely support from IT and operation generation prediction [9], anomaly detection [10], cyber- engineers as they were required to work from home and many security-based intrusion detection systems [11], Locational essential applications for generation dispatch did not allow Marginal Price (LMP) forecast and autonomous voltage control remote access [14] increasing the challenges faced by grid [12]. operators. How to increase the fault tolerance of essential controls to keep the lights on during such circumstances is a This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 6 new and emerging challenge for system operators. from external sources such as weather forecast and asset With the aid of cloud computing technology, some of the management system, the utility companies can perform Bulk Electric System (BES) reliability services can be securely complex analytics to enhance the customer experience as well backed up through the CSPs which offer far more infrastructure as their operational efficiency. Cloud computing technology, scalability and resilience and individual power entities. undoubtedly, will facilitate this process because it simplifies the Whenever the local control facilities, such as software setup of a centralized repository to accumulate and store data applications, communication networks and physical servers are from heterogeneous sources, also known as “data lake” [15]. A unavailable or inaccessible, the backup control system on the data lake can include structured data from relational databases cloud can be manually activated to take over crucial operation (rows and columns), semi-structured data such as CSV, JSON tasks to maintain business continuity. Since the cloud-hosted and more, unstructured data (documents, etc.) and binary data control system can be accessed from anywhere by means of such as images or video that are uploaded by consumers. In secure connections and validated identities, the operators are addition, cloud-based data lakes offer benefits such as auto- more likely to continue performing their job function more scaling as well as providing utility companies and their comfortably and reliably during abnormal times so that the customers a unified entry to access their heterogeneous data in operation continuity is guaranteed. a centralized location. In contrast, traditional on-premises One of the advantages of adopting cloud technology to back solutions often result in silos of information across different up fundamental grid controls is the availability of higher level systems, which requires additional integration efforts to use the of fault tolerance, which can be achieved via redundant and data. Furthermore, cloud technology makes it easy to create, fault-tolerant architectures. Unlike traditional control center configure, and implement scalable mobile applications that configurations (i.e., main/backup) that are built independently seamlessly integrate them with backend systems where but are located geographically close, the cloud workload on advanced analytics on heterogeneous data is performed. which the control center is implemented can be hosted in G. Advanced Distribution Management various data centers from across a larger geographical footprint, reducing the likelihood of a single fault affecting operations. Power & Utilities is known as a legacy industry, with a lot Even if a catastrophic event were to impact multiple cloud data of devices, processes and computations that have not yet taken centers in a specific region, these functions can be quickly full advantage of the recent advances in power electronics and switched or redeployed to alternative data centers in a different software engineering. However, there is a growing number of region that are geographically distant from the affected one. IoT and/or power electronic devices and sensors in the form of Many cloud providers now offer their own Domain Name measurement and control points spread across the electric grid. Server (DNS) service which makes a regional failover quick A large number of these new sensors and devices are installed and imperceptible to the users. A region-wide extreme weather in the distribution grid, near the end-consumer, requiring a event is a good example to illustrate this point. In the event of a scalable, secure, reliable, and intelligent Advanced Distribution hurricane or a snowstorm that sweeps over the US Northeastern Management System (ADMS). area, neither of the two independent control centers can be ADMS provides numerous benefits. It improves situational exempt from the high risk of telecommunication or power awareness by providing a single view of system operations in supply disruption. In case of the disturbance to the IT or OT near-real-time, ranging from distribution substations to infrastructure, the backup control service hosted in the cloud customer premises, and other utility systems. The cloud is an would be the “spare tire”. Since the cloud workload may be optimal location to host ADMS as it provides secure integration hosted in a different region other than the Northeastern, e.g., options for IoT devices and sensors, as well as other sensors and through a data center in South California, the grid operator can utility systems. Cloud technology also supports highly scalable resume the system operation and control capability with ease data ingestion patterns: from real-time streaming of data to through a secure, encrypted tunnel to the cloud service batch ingestion of data from traditional systems. irrespective of the access location. By supporting various data ingestion patterns coupled with low cost, durable storage solutions, the cloud enables data F. Operational Efficiency and Customer Experience driven innovation, as opposed to capturing and storing only a Improvement subset of the data due to cost and computing resources A distribution utility is the link that connects consumers with constraints. The availability of more data from using cloud the bulk grid regardless of the source of electricity. Today’s technology allows all data to be analyzed in tandem with residential energy customers are asking for more from their physics-based models of the electric grid to automate electricity providers: more clean energy options, more identification of data anomalies, model inaccuracies, and gain information transparency, and higher Quality of Service (QoS). new insights into grid operation. By using horizontal and This new reality of increased consumer expectations and desire vertical scaling the cloud-based ADMS can efficiently analyze has led to electricity providers offering many new products and models with millions of system components. services beyond basic electricity delivery. For example, a user- Cloud-based ADMS can utilize serverless, extensible, and friendly, online power outage map with an integrated incident event-driven architectures to operate system reliably during reporting system is a fundamental need for today’s customers. unplanned events (e.g., faults, fault induced switching, changes With the data collected from AMI meters, user feedback in weather, measurement/control failures). These features also collected through the customer reporting system and the data This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 7 position cloud-based ADMS an attractive choice for future grid artificial intelligence, and data mining algorithms. operation that incorporates new elements such as large-scale I. Market Settlement Plug-in Electric Vehicles or PEV integration. More specifically, serverless architecture allows modular applications to be added Electricity is continuously generated and consumed on a 24- and updated over time to support event-driven workflows in hour clock, but settlement periods are defined as distinct time response to a rapidly evolving grid with increasing frequency frames, e.g., an hour in North America and 30 minutes in of unexpected scenarios mentioned above. Australia. With ever-growing integration of renewable resources and energy storage, the owners of resources are H. Distributed Energy Resources Management expecting to bid (offer) more frequently so that they can The modern electric grid is rapidly evolving with the most respond quickly to electronic dispatch instructions to realize prominent advancements being made on the low-voltage most of the market efficiency gains due to the aligned distribution end. In part, the evolution is driven by a sudden settlement of five-minute profiled MWh intervals and five- explosion of agile, smart, Internet connected, and low-cost minute energy and reserve pricing. In short, reducing the DERs. The phenomenon is underscored in recent industry settlement time blocks can better compensate market reports; one by Wood Mackenzie Ltd. [16] that estimates DERs participants for the real-time energy and reserve products they capacity in the U.S. to reach 387 gigawatts by 2025 and another are delivering. However, a shorter settlement interval places a from Australia [17] which predicts that 40% of the Australian higher requirement on computational and data management energy customers will use DERs by 2027. Alone these DERs capability of the market operators. By exploiting elastic are either low-power or control low-power equipment amount computing and big data infrastructure through the cloud, energy to a significant impact to the grid individually; however, once market operators can easily acquire the ability to process vast aggregated they can be of a significant impact on strengthening amounts of data in a timely manner while greatly reducing grid reliability or resiliency. They can provide much needed operating expenses for themselves as well as the market grid services such as demand charge reduction, power factor participants. correction, demand response and resiliency improvement. J. Collaborative System Modeling and Hybrid Simulation These unique facets of DERs have precipitated an avalanche of new products [18] [19] from various entities that enable The cloud can be used as a central hub for collaborative aggregation and monetization of DERs; many of which aim to model development and Transmission-and-Distribution (T&D) use aggregated resources in the bulk energy market to optimize co-simulation. The recent issuance of FERC Order 2222 opens for asset profitability. up a new opportunity for DERs to participate in the wholesale Cloud-based technologies are an ideal choice for these electricity market in North America, leading to an increase of products as they can unify control of various participating joint T&D network studies following a hybrid co-simulation DERs in one central location under the authority of a designated approach. These studies include but are not limited to: i) market participant. Such a unified cloud-based environment coupled T&D time-domain co-simulation combining significantly simplifies the management of participating DERs electromagnetic transient (EMT) simulation with through synchronous collection of data from various electromechanical dynamic simulation and ii) coupled T&D controllable devices and facilitates large-scale analytics that steady-state power flow and optimizations. In general, these otherwise could not be performed in remote locations. The combined T&D simulations can span up to hundreds of millions approach itself is more of a necessity than a choice due to of solution variables with data scattered across many geographical spacing of the DER devices, which are typically geographically separated locations. Solving analysis problems located with low computing resources. Nonetheless, almost all of this scale is not possible on a single compute node. While of these resources are connected to the Internet through IoT- traditional on-premises computing clusters with limited based sensors and hence enable low cost and low effort cloud resources for parallelism can facilitate such hybrid simulations, aggregation solutions via general API based products [20]. their efficiency, in terms of speed and robustness, is In the DERs connected grid, a single gigawatt resource is unacceptably low due to the involved complexity and high- replaced via hundreds of thousands of small DER resources. volume data exchange between the various entities. With access Optimal use of these DER resources, the DER aggregation and to cloud scalability and cloud elasticity, the co-simulation of dispatch algorithms require advanced computing resources as large T&D models can be fully accelerated. With the ability to the number of decision variables rises exponentially. Once scale the compute and memory resources at will, the industry again, the cloud’s elastic computing capabilities, which scales can even incorporate other factors such as weather, finance and up and down in tandem with needs, reduce or eliminate the need fuel constraints into modeling and simulation without having to for hefty financial investment in large inelastic on-premises worry about lack of computing and storage resources. One such infrastructure, data center infrastructure and personnel. cloud-based co-simulation approach has been validated (with Looking into the future, cloud technology for DER aggregation some proven efficacy) in a recent co-simulation platform and services is expected to expand significantly to HELICS [21] that has been deployed by a consortium of U.S. accommodate new-age resources (such as EV batteries, home- national labs. battery systems, and smart thermostats) and new-age grid Furthermore, cloud technology can also facilitate the analytics that will include state-of-the-art optimizations, collaboration of model development between the grid operators and distribution companies with its easy-to-access and easy-to- This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 8 share characteristics, which is particularly helpful when the Model (SRM) [22] [23]. In general, depending on which cloud personnel are geographically separated. The COVID-19 service model is adopted, the users bear different areas of pandemic has highlighted the heightened need for such responsibility for cloud security. collaborative effort. Moreover, since data security is sensitive For instance, in the Infrastructure-as-a-Service (IaaS) model, to power grids, the cloud can be the perfect solution when it the CSP is responsible for the security of the underlying comes to running distributed simulations or optimizations, infrastructure that supports the cloud (security of the cloud), wherein data or models are spread across multiple utilities. while users are responsible for anything that they store on the Cloud facilitates this by allowing various utilities to store and cloud or connect to the cloud (security in the cloud). Even if a protect their own data privately while exchanging minimum Software-as-a-Service (SaaS) solution is selected, users still common dataset with other entities to responsibly run the must consider the security of data in transit. In a nutshell, users distributed simulation and optimization studies. should always be aware of their responsibility when they are using a public cloud service. Also, not all SaaS solutions are K. Coordinated System Operation Drill directly provided by CSPs. Many are provided by Independent It is becoming increasingly difficult to operate the electric Software Vendors (ISVs) who may host their service on a grid with the increased penetration of DERs, higher instances CSP’s environment, or an on-premises data center operated by of extreme weather events, threats of cyber-physical attacks and the ISV. Users are not always aware of these differences and loss of know-how due to retiring workforce. Increasing drill therefore cannot always protect their data and systems participation by role playing the different jobs can help operate comprehensively. the power grid more reliably during such events. However, such Guidance for cloud security is discussed in Section IV. A. drills must be done in a simulated environment as these events 2) of the paper. cannot be practically realized in real-life without impacting the 2) Service Reliability Related Challenges day-to-day operations of the grid. With access to a realistic In addition to the security responsibility, service reliability simulation environment, multiple participants can is another concern for power utilities when using the cloud. collaboratively discuss and analyze extreme events. They can Service reliability comes in many flavors, which are further replay, rewind, or reload any prior or future scenario with no explained below: restriction on time or frequency of the analysis. a) Network Latency The advances in cloud computing have provided the Communication over a network always involves latency. opportunity to develop a simulation environment that enables However, high latencies lower the customer satisfaction and real-time coordination among multiple participants, including adversely affect the business services that are hosted remotely. reliability coordinators, balancing authorities, transmission For utility users, connecting to a cloud service over a network operators, generator operators and distribution operators. A with high latency may result in erroneous decision making, cloud-based simulator can help bridge the gaps among planners, issuance of inaccurate control order or even system engineers, managers, system operators and cyber defenders. malfunction. Using one real-time tool, they can all observe and react to b) Network Connection Disruption system behaviors such as exceedances of MW transfers limits, Network connectivity, not necessarily Internet, is essential extreme voltage violations and large operating angles. for a consumer to access cloud services. Power utilities have a choice between public Internet and specialized III. CHALLENGES OF CLOUD ADOPTION IN POWER INDUSTRY telecommunication services that cloud providers offer in Within this section, we discuss the challenges of cloud conjunction with telco companies (i.e., private networks). adoption from the perspective of a) power grid users and b) While private dedicated connections provide much higher software vendors. We provide a brief discussion on compliance reliability, they are susceptible to network outages like any related challenges for cloud adoption as well. We provide other network. How to handle the occasional occurrence of a corresponding guidance to overcome these challenges in the network disruption event to alleviate any likely impact of next section. network outages on grid workloads is another challenge faced by power grid utilities. A. Challenges for Power Grid Operators and Utilities c) Cloud Service Outage The reasons that make average power utilities and grid Another outage that is even rarer than the network outage is operators hesitate to adopt the cloud technology vary from the cloud service outage. Unlike the on-premises solutions that organization to organization, but according to a survey customers can continuously rely on even when the vendors have conducted by our task force, the top three concerns for the stopped providing support or have closed the business, users on power utilities and grid operators are cloud security, service public cloud depend on near-real-time provisioning of services reliability and cost. by providers. Any duration of cloud service outage will lead to 1) Cloud security and responsibilities in the cloud an interruption of the user’s business process. Cloud security is a key challenge to cloud adoption. Equally 3) Cost Challenges critical is the question of who is responsible for this security? As shown in Fig. 3, in reality the cloud security responsibilities Another hurdle for utility companies to migrate their are split between the power system users and the CSP. Such a application portfolio to the cloud is the financial categorization responsibility division pattern is called Shared Responsibility of cloud expense. While the purchase of infrastructure such as This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 9 in-house servers and their peripherals is considered as a capital operate on distributed memory resources. Section IV. B. 1) investment and these can be recovered from utility’s rate base c) provides guidance on adopting software for enabling high- [24], paying for the cloud services is currently regarded as an performance computing on cloud. expense that falls into the category of Operation and d) Challenges with stateful software Maintenance (O&M) cost. The O&M cost is not recoverable by Most software in the power industry today is stateful. In a the utility companies. nutshell, whether an application is stateless or stateful depends Guidance in the form of solutions or mitigation measures for on where the “states”, such as user profile information and the above-stated concerns related to service reliability and cost client sessions, are stored. While stateful applications maintain of cloud application are provided in Section IV. A. 3) and the state data on the server itself, stateless applications put these Section IV. A. 4) respectively. data on a remote platform. Stateful applications are known to have superior performance because they don’t need to process B. Challenges for Software Vendors of Cloud Applications as much data in each client request in comparison to stateless To practitioners in the electric energy sector, cloud programs but being “stateful” has challenges when it comes to computing is an emerging technology and therefore software working on the cloud. First, it restricts the scalability of vendors at the frontier of providing cloud services are likely to resources because application states are stored on the server face challenges as well. These key challenges from a software itself. Replicating these states to newly launched servers in vendor’s perspective can be categorized into the following: the response to varying demand increases the processing overhead software design pattern, the licensing mechanism, and the and thus lowers the application performance. Furthermore, the pricing model. users of such an application need to continue sending requests 1) Software Design Challenges to the same system that maintains their state data; otherwise, We consider the key design challenges for cloud adoption they will lose historical context. As a result, the users may below: experience delays and shutdowns as traffic rises to the degree a) Monolithic architecture that the server cannot handle. Second, it reduces visibility of Most applications running in production systems today have interactions, since a monitor would have to know the complete monolithic architecture. Often, these can be ill-suited for cloud state of the server. Therefore, stateless programs should be adoption, which has led to many companies realizing that considered for cloud-based software. More guidance for this is simply moving their legacy system to the cloud either brings provided in Section IV. B. 1) d) of the paper. them marginal benefits or even worse, negative benefits with 2) Software Licensing Mechanisms and Challenges unexpected problems. According to [25], challenges Commercial software applications are licensed in several with monolithic architectures are four-fold: 1) fault isolation ways. The licensing methods differ in the management cannot be contained; 2) they are hard to scale; 3) deployments interface, process, allocation and availability. The power are cumbersome and time consuming; 4) and they require a system software vendors, just like their peers in other business long-term commitment to a particular technology stack. domains, usually adopt all well-known methods to license their Therefore, for cloud-based software there is a need for products. The most common licensing methods include dongle evaluating other software architectures. Section IV. B. license, node-locked license or single use license, Bring Your provides guidance on software architecture for cloud. Own License (BYOL) and network license (also called floating b) Software portability to cloud or concurrent license) Software portability is another challenge with cloud To understand the licensing mechanisms and their suitability infrastructure. Most grid software today is compiled for a for cloud adoption, let us consider the software programs for specific operating system and depends on routines for a specific transmission study as an example. TABLE I summarizes the hardware and set of libraries. Hosting these tools on cloud can licensing options available through a variety of the most be time-consuming and tedious as these would require re-build of software based on the choice of operating system, system popular power system software tools used for transmission libraries, and hardware. Therefore, advanced techniques (such operation and planning study in the industry. It is observed that as containerization etc.) must be considered by software most of them support migrating the user’s software-dependent architects for easy portability to the cloud. Section IV. B. 1) workloads to the cloud, either by offering a cloud-compatible b) provides guidance on managing software portability for licensing option or through a separate version of the product. cloud. As of when this report was written, PLEXOS and PSS/E (still c) Inefficient scalable and parallelizable code and in the test phase) offers a subscription-based license option to algorithms support cloud adoption. MARS and MAPS are allowed to run Existing power systems software tools are generally not in the cloud if the license file is uploaded to the same host as designed for high performance computing and are often written well. In this case, the license validity time depends on the to work on desktop computer using centralized memory and contract between GE and the customer. TARA also works well single compute cores. Therefore, without modifications, most with cloud, but its license needs to be updated every so often of these tools will not take advantage of many cloud features because the cloud borrowed license is temporary. The other when hosted “as is” in the cloud. For example, to succeed in tools currently have limited support or lend no support for use high-performance computing applications, software will have in the cloud. For example, DSATools uses broadcast approach to adapt to algorithms that are easily parallelizable and can to search the working servers, while broadcast and multicast are This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 10

Fig. 3. Shared responsibility model (source:aws.amazon.com); The orange section represents the underlying infrastructure, which is the CSP’s responsibility. The layer on top of the underlying infrastructure stands for the users’ responsibility

TABLE I A COMPARISON OF LICENSING OPTIONS BY SEVERAL COMMON SOFTWARE PROGRAMS FOR TRANSMISSION STUDY (Blank indicates “not sure” this specific licensing option is being supported or not) Node-locked Subscription Dongle License BYOL Network License Suitable for Cloud License Licensing EnergyExemplar √ (PLEXOS √ Yes PLEXOS Connect) Siemens PTI √ √ (CodeMeter) Yes PSS/E PowerGem TARA √ short-term Yes, duration GE MARS/MAPS √ depends PowerTech √ √ Only on single VM DSATools Mathworks √ (MATLAB in √ √ Yes MATLAB the Cloud) PowerWorld √ No ABB GridView √ √ (cloud version) Yes GE PSLF √ short-term typically not supported by major cloud providers. For the sake environment. It lacks the flexibility to enable users to rapidly of this, DSA servers are not able to do horizontal scaling (scale launch an experimental or non-production workload in the out, i.e., dynamically adding more virtual machines). They can cloud, e.g., for agile development or to test a new workflow. only be scaled up by requesting to put more resources For a utility who wants to run a production workload in the (CPU/RAM/DISK) on the individual virtual machine Software cloud, the most likely outcome today is that they end up paying Pricing Model. Therefore, it can be concluded that while two sets of licenses – one set for cloud use and another for on- different software providers have many licensing mechanisms premises use. Given a fixed range of budget, it would be a waste for usage in the cloud, few of them are designed for effortless of on-campus licenses if they made a pessimistic estimate for cloud use. Guidance for licensing mechanisms that are most the cloud usage (buying too few cloud licenses). Or suitable for cloud is given in Section IV. B. 2) . alternatively, they would constrain the on-premises workflow if 3) Software Pricing Models and Challenges the cloud license was over-provisioned. Therefore, other Software pricing models for a product vary from vendor to pricing mechanisms must be considered for cloud software. vendor, but at present they can be categorized into two primary For guidance on how to make cloud licensing and pricing categories: pricing by license and pricing by subscription options better adapt to cloud environments for power grids (usage). The former approach is what the majority of power users, see Section IV. B. 2) . system software vendors are providing. In this pricing model, C. Compliance Related Challenges the price comprises of one-time provisioning fee and a periodical maintenance fee in terms of payment cycle. This In addition to technical and financial concerns, power pricing model can be further augmented to include additional utilities are hesitant about cloud adoption for regulated cost for add-on features in addition to the basic module. Some workloads due to lack of clarity for compliance obligations vendors will even charge for the concurrent use and/or off- associated with using cloud technology to run either the campus use. No matter how the fees are broken down, these are mission-critical workloads or the non-critical workloads. significant costs to the utility users. Critical workloads are described as any service or This kind of traditional pricing model is one of the factors functionality that supports the continuous, safe and reliable that hinders many utility companies from utilizing the cloud operation of bulk power systems. In North America, critical technology voluntarily, because they will face a huge amount workloads are mandated to comply with a dedicated set of of additional cost on software licensing if they move their standards, known as NERC CIP [26]. The currently effective business solutions built upon these tools to the cloud. In a NERC CIP requirements, at the time of writing, are silent on nutshell, the pay-by-license pricing is unsuitable for the cloud use of cloud technology and their device-centric structure raises This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 11 questions on how to demonstrate compliance to the data center, or it can be hosted by a third-party service provider. requirements. For example, there is heavy emphasis in the In a private cloud, the services are always maintained on a current definition on physical assets within the Electronic private network and the infrastructure, hardware and software Security Perimeter (e.g., the very specific term “in those belong exclusively to the organization. A private cloud makes devices” referring to BES Cyber Assets. A BES Cyber Asset is it easier for an organization to customize its resources to meet defined as A Cyber Asset that if rendered unavailable, specific IT requirements compared to other cloud types. degraded, or misused would, within 15 minutes of its required Historically, private clouds were often used by organizations operation, misoperation, or non-operation, adversely impact with business-critical operations seeking enhanced control over one or more facilities, systems. The key cloud concepts such as their environment, such as government agencies, financial virtualization, logical isolation, and multi-tenancy are not institutions and healthcare companies. But in recent years, the provisioned in the standards or in the current definitions. creation of cloud friendly compliance requirements such as For the non-critical applications, such as use of BES Cyber FedRAMP for federal government, HIPAA and HITRUST CSF System Information (BCSI), can be done in the cloud. NERC for healthcare industry, PCI and SEC Rule 17a-4(f) for financial is on record stating that Entities have BCSI in the cloud and are services [28] [29]have opened the door to the adoption of public doing so compliantly. Additional guidance on how to use cloud cloud by critical sectors of the economy. securely, reliably and compliantly are in early stages. In June Public clouds are the most common type of cloud computing 2020, NERC released a guideline on BCSI and cloud solutions, deployment. The resources leased by users, including all which is a start, but of limited scope as its main focus on the hardware, software, and other supporting infrastructure for encryption of BCSI. Implementation Guidance on cloud computing, storage, networking, etc., are owned and operated encryption and BCSI is pending NERC endorsement. by CSPs and delivered over the Internet. The major public cloud Addressing compliance challenges requires the providers are Amazon Web Services, Azure, Oracle collaboration among multiple parties, including the regulatory Cloud, Google Cloud, , IBM, RedHat, Alibaba, and bodies, utilities, CSPs, ISVs, third party auditors. A general Tencent. guidance on how to deal with compliance for cloud applications A hybrid cloud is a combination of on-premises is given in Section IV. C. infrastructure—or a private cloud—with one or more public cloud services. Hybrid clouds allow data and applications to IV. CLOUD ADOPTION GUIDANCE move between the two environments. Many organizations choose a hybrid cloud approach due to business imperatives A. Guidance for Power Grid Operators and Utilities such as meeting regulatory and data sovereignty requirements, The five pillars of cloud adoption are operational excellence, reducing the network latencies, taking full advantage of on- reliability, security, performance and cost optimization [27]. premises technology investment while simultaneously keeping Cloud solutions provide great technological capabilities and the ability to scale to the public cloud and pay for extra benefits when it comes to any of these columns. However, computing power only when needed. utilities should never rush to adopt cloud-based solutions The characteristics of these cloud types are compared in the without evaluating all the implications. This decision to migrate table below. Users should choose the cloud type by aligning to the cloud should only be made after due diligence with strong their goals with the characteristics below. business drivers identified. TABLE II 1) Choice of Cloud Type, Service Model and Provider CHARACTERISTICS OF DIFFERENT CLOUD TYPES Cloud Type Cloud computing can be categorized into public, private, and Private Public Cloud Hybrid Cloud Cloud hybrid clouds based on its ownership. From the service model Characteristics perspective, cloud computing is typically delivered to Control High Low High customers in terms of Infrastructure-as-a-Service (IaaS), Flexibility Low High High Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). In recent years, other service models such as Container- Scalability Low High Medium as-a-Service (CaaS) and Function-as-a-Service (FaaS) are also Reliability Low High Medium emerging as the new attractive cloud offerings. What cloud type Cost High Low Medium to choose? Which service model is the most suitable for a Maintenance effort High Low Medium business user, particularly a utility company? The answers vary On-prem workload Low High Medium from case to case depending on the IT budget, business migration effort characteristics and value proposition of the company. a) Selection of Cloud Resource Model b) Selection of Cloud Service Model There is no single cloud computing type that is applicable Few cloud service models exist, and they differ on how the for all needs. Companies should analyze the advantages and responsibilities are shared between the service provider and the disadvantages of each cloud type and align their goal with these customer. The more responsibility the customer shifts to the characteristics before making the decision. service provider, the more convenience they will enjoy, but on A private cloud consists of cloud computing resources that the other hand, the less control and transparency they will have are used solely by one enterprise or organization. The private over the cloud offering. Due to the page limit, we only provide cloud can be physically located at an organization’s on-campus guidance for the best fit scenarios for the power industry: the This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 12

IaaS, PaaS and SaaS models. rebranded their business and now refer to themselves as cloud SaaS - While the SaaS model gives the least control over the providers, and then there are CSPs who offer hyper scalability software and the underlying services, it also provides and a wide range of services across the world. Even with the significant benefits on the other fronts. SaaS is affordable as it latter it may not be obvious to the reader of this paper as to eliminates the costs involved with purchase, installation, which cloud provider is best or ideal fit for their needs. It is not maintenance, and upgrades of computing hardware. With SaaS, the intent of this paper to promote one cloud provider over the services can be accessed from any device such as company another. laptops, smartphones, and tablets, which eliminates the Utilities should always research and investigate a cloud constraints set by on-premises software. In this way, SaaS provider before using their services. One approach to identify fortifies intra-company and inter-company collaboration reliable cloud providers is to look at the certifications, through its easy-to-access and easy-to-share features. accreditations, and regulatory controls that a cloud provider has Furthermore, the service in SaaS mode can become functional earned or demonstrates. For example, CSPs who meet in no time. All it takes is that you sign up for it. TABLE III FedRAMP [30] moderate and high requirements will be much summarizes a few applications where the SaaS model is more secure than those who do not. FedRAMP requires appropriate. Other similar use-cases should also consider the accreditation (before being functional) for each individual SaaS model. cloud service a CSP offers. Each accreditation can take a year IaaS - IaaS is the most flexible service model that gives the or more to obtain and CSPs are routinely audited to demonstrate best control over the hardware infrastructure such as managing their compliances with FedRAMP and other regulatory and identity and access, customizing guest operating systems and compliance requirements and standards like PCI, HIPAA, SOC upper-level applications according to the users’ requirements. 1, 2, 3, and ISO 27001 [28] [29]. While these accreditations Deploying an IaaS cloud model eliminates the need to deploy have no direct impact to the power industry that meets NERC on-premises hardware, which reduces the costs. IaaS CIP requirements for compliance, they demonstrate that CSPs streamlines the deployment of servers, storage, and networking are well positioned to provide the reliability and security to make it run in no time. Moreover, as the most flexible cloud required by critical infrastructures such as the power industry. computing mode, IaaS allows the users to scale the resources More discussions are given in Section IV. C. 1) . up and down to build their cloud solutions based on their needs. 2) Managing Security IaaS will be the best fit for the scenarios when users make their In general, the scale of public cloud service providers allows own decisions on what applications to run in the cloud and significantly more investment in security controls, policing and require porting licenses from on-site systems. Some examples countermeasures that almost any large company could afford are shown in TABLE III. themselves. This is even more pronounced for smaller power PaaS - Compared to IaaS and SaaS, PaaS is the middle layer utilities. However, security is a shared responsibility and cloud where you can offload most of the work to the provider and fill users still have to invest in security controls applicable to their in the gaps as needed. PaaS reduces the development time since roles. Based on the Shared Responsibility Model (SRM) shown the vendor provides all the essential computing and networking in Fig. 3, users are still responsible for “security in the cloud”, resources, which simplifies the process and improves the focus which means controls to secure anything that the user puts in of the development team. PaaS offers support for multiple the cloud or connects to the cloud. There is no “hands free” programming languages, which a software development mode when it comes to cyber security controls in cloud company can utilize to build applications for different projects. adoption. Power utilities must ensure that professionals are in With PaaS, your business can benefit from having a consistent place to deploy the right security policy for their cloud-hosted platform and unified procedure to work on, which will help services. Fig. 4 shows a comprehensive scheme to secure the integrate your team dispersed across various locations. cloud workloads. It depicts an organization of security control Few selected use cases that fit best into the PaaS model are in the cloud. The diagram groups the security control domains listed in TABLE III as well. at a high level. Again, this is a shared responsibility, so cloud users, ISVs and the CSPs will share responsible for the controls, TABLE III depending on which cloud service model the user adopts. While A LIST OF SCENARIOS THAT FIT THREE COMMON SERVICE MODELS SaaS PaaS IaaS users may not be directly responsible for security control, they Planning studies, are still strongly encouraged to ensure that the controls shown Modeling and Wide-area monitoring Simulation, Big data analytics, in the diagram are enforced by the vendors. We further discuss and situational Asset Management, Machine learning awareness, the key controls in more depth in the following paragraphs. Collaborative DevOps Power outage map a) Virtual Network Control operation and incident reporting A Virtual Network (VNet) or (VPC) is a network environment dedicated to the user’s account in the c) Selecting a Cloud Service Provider public cloud. This virtual network closely resembles a Cloud as a technology is not unique nor standardized. Cloud traditional network that one can operate in their own data is a generic description for variable compute or IT resources center, with the benefits of using the scalable infrastructure. A delivered on demand via the Internet. Many organizations use VNet in the context of Azure or a VPC in the context of AWS on premise virtualization and refer to their environment as a and Google Cloud logically isolates infrastructure by workload private cloud. Many co-hosting service providers have This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 13 or organizational entity. In the virtual network, the users can individual users for anyone who needs access to the account, launch the cloud resources to meet any business needs and have including the system administrators. Give each IAM user zero complete control over them. They can also create multiple permission at first while their account is being created; grant subnets that span across AZs to define different access rules. To them only necessary permissions to fulfill their job duties by achieve a higher level of security, users are recommended to request. separate the public-facing workload (e.g., web applications) Use roles to delegate permissions – Do not share credentials from backend servers by accommodating them in different between accounts to allow users from one account to access subnets. The public-facing workload should be deployed in a resources in another account. Instead, designate the users in public subnet to communicate with the Internet while leaving different accounts to assume different IAM roles for their the backend process in a private subnet so the inbound and access. Also, use role-based access for applications to access outbound traffic cannot be directly from/to the Internet. cloud resources. Unlike IAM users, roles are basically Sometimes the users want to extend their network into the cloud temporary credentials that are generated randomly and using CSP’s infrastructure without exposing the corporate automatically rotated for whoever assumes the designated role. network to Internet. In this scenario, the workloads can be Rotate credentials regularly – Enforce all users in the deployed in a private subnet of the virtual network and a site- account to change the passwords and access keys regularly. to-site IPSec VPN tunnel [31] can be established from their That way, a compromised password or access key that is used network to the cloud directly. Since there is no Internet gateway to access the resources in the account with the permission can to enable communication over the Internet, the likelihood of be limited. cyber-attacks and data breaches are significantly reduced. Remove unused credentials – Disable or simply remove IAM user credentials (passwords and access keys) that are not needed to protect your account against unapproved access. Integrate with existing identity providers – Most CSPs offer the ability to integrate their IAM with popular identity providers to allow identity federation. By doing so, power utilities can centralize access controls, improve efficiency, and maintain compliance of processes and procedures. c) Vulnerability Control While utilities pursue a rigorous vulnerability management in their on-premises systems, they may misplace their trust in cloud providers when it comes to vulnerability control in the cloud environment. However, the cloud providers are only responsible for securing the underlying infrastructure (the hardware and firmware). The customers are required to detect

Fig. 4. Comprehensive security control in the cloud and address vulnerability in their solutions on their own or through a trustworthy third party. The third party cloud includes b) Identity and Access Management a vendor who delivers the solution via the cloud or a partner of Identity and Access Management or IAM is critical to the cloud provider who offers professional vulnerability protect sensitive enterprise systems, assets and information scanning and patch management service. from unauthorized access or use. The best practices for IAM on Gaining visibility into vulnerabilities in the code is key to the cloud include but not limited to: reducing the attack surface and eliminating risk. Power utilities Use a strong password for account-level access – A strong usually rely on security tools to evaluate the potential risks or password is highly recommended by CSPs to help protect the deviations from the best practices in their applications. They account-level access to the cloud services console. Account should follow the same practices for the cloud-based solutions. administrators can define custom password policy to meet their There are quite a few security assessment services for different organization’s requirements. Typically, strong passwords cloud systems on the market and utilities should make use of should consist of a combination of uppercase and lowercase them. These services, which are updated regularly by security letters, numbers and special symbols, such as punctuation. The researchers and practitioners, can help improve the security and minimum length of password should be 8 characters or more. compliance of applications deployed in the cloud environment. Enable MFA – For extra security, MFA should be enforced It is more efficient and convenient for power system companies for all users in the account. With MFA, users have a device that who are adopting an IaaS-based solution to take advantage of generates a response to an authentication challenge. Both the such services. Management of the vulnerability can be also user's credentials and the device-generated response are done by transferring the responsibilities to a reliable vendor required to complete the sign-in process. If a user's password or especially in the case of PaaS and SaaS service models. Finally, access keys are compromised, the account resources are still it should be noted that it is the utilities’ responsibility to perform secure due to the additional authentication requirement. a regular review of what vulnerability controls have been Create individual users for anyone who needs access to the adopted by the vendors and ensure they have minimum cloud service – the root user’s credentials and access key for practices in place. programmatic requests should be locked away. Instead, create This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 14

d) Data Traffic Control that the software packages, the application dependencies and In a virtual network, cloud users can define fine-grained container images that are deployed on the cloud are free of rules to have complete control over the inbound and outbound malicious code. This is similar to the job function of the IT traffic of data. These control rules, defined by IPv4 or IPv6 department in utilities today wherein they sanitize the software Classless Inter-Domain Routing (CIDR) blocks, act as virtual that are used locally. firewalls for the users to allow or block access to the virtual 3) Service Reliability: Building Fault-tolerant Solutions machines, or the subnets. Users who adopt solutions that are Power system users should aim at building reliable, fault- built on top of IaaS services should check the user guide of their tolerant, and highly available systems when considering cloud- corresponding CSP to learn the details about the definition of native adoption or migrating on-premises workloads to the security rules for traffic control. cloud. A fault-tolerant architecture helps the user to ensure high e) Data Protection availability and operational continuity in the event of failures of In addition to the IAM setups that help protect data, e.g., some components. enforced MFA, user account separation from root account, the a) Mitigation of Network Latency Impact following approaches should be considered as well by the The following techniques can be used to alleviate the impact utilities to further protect the data when it is: i) at rest, ii) in of network delay on the business workloads: transit or iii) in use. Utilize CSP provided dedicated networking: Many CSPs Data at rest refers to data that is stored or archived on some work with telecommunication providers to offer dedicated media and is not actively moving across devices or networks. bandwidth private fiber connections to the CSPs region from Although data at rest is sometimes considered to be less the customer’s location. This allows for customers to have vulnerable than data in transit, power system users should never stable known performance with no risk of network latency due leave them unprotected in any storage (cloud or on- to general Internet traffic. AWS Direct Connect and Azure prem). Protection of data at rest aims to secure inactive data, ExpressRoute are examples of such services where a dedicated which remains in its state. Usually, encryption plays a major fiber connection is established between a utility’s data center role in data protection and is a popular tool for securing data at and the cloud. rest since it prevents data visibility in the event of its Shift data processing to the edge: For data collected by IoT unauthorized access or theft. The data owners are obligated to devices, they can be processed at the edge, which is known as know what encryption algorithms a cloud provider supports, edge computing, to further address the concern around latency what are their respective strengths or cracking difficulties, and by reducing the payload for transmission. what key management schemes they provide. The Leverage load-balancing: The users can distribute traffic recommended encryption method to secure data at rest is across multiple resources or services by means of load Advanced Encryption Standard (AES) with a key length of 256 balancing to allow the workload to take advantage of bit, i.e., AES-256. AES-256 is the strongest and most robust the elasticity that the cloud provides. encryption standard that is commercially available today. It is Choose workload’s location based on network requirements: practically unbreakable by brute force based on current Use appropriate cloud locations to reduce network latency or computing power. improve throughput. The users can select those edge locations Using CSP provided Key Management Services (KMS) is which have low delays to host the workload. When the edge another way to protect data. Such services integrate with other location becomes congested, use the local balancer to reroute CSP services to encrypt data without actual movement of key the traffic to other edge locations where the latency is low. material. A user can simply specify the ID of the key to be used. Distribute workload across multiple locations if needed. All actions using such keys are logged so there is a record by Optimize network configuration based on metrics: Use whom and when a key has been used. KMS systems also help collected and analyzed data to make informed decisions about with automated key rotation. optimizing your network configuration. Measure the impact of Likewise, data protection in transit is the protection of data those changes and use the impact measurements to make future while it is being moved from one location to another, decisions. For example, if the network bandwidth is low, it particularly in this context, when it is being transferred from a might serve well to increase the bandwidth. In addition, local storage device to a device or the other way consider options for higher quality bandwidth with less packet round. Wherever data is moving, effective data protection loss and provision for retransmission. Usually business-class measures for in transit data is critical as data is often considered Internet services and dedicated circuit networking can meet less secure while in motion. To protect data in transit, it is very these goals. common for enterprises to encrypt sensitive data prior to Critical workloads, especially those supporting real-time moving and/or use encrypted connections such as HTTPS and decision making and bulk power system operation, are more SFTP (secured by security certificates via SSL or its successor sensitive to network latency. Therefore, apart from the general TLS) to protect the contents of data in transit. network optimization approaches mentioned above, some Data in use refers to data in computer memory. It is usually additional countermeasures should be adopted if there is a considered to be the cloud provider’s responsibility to ensure likelihood of hosting critical workloads on the cloud. For the underlying hardware and OS are malware-free, based on the instance, an offline work mode should be provisioned in the SRM. However, it is still the user’s responsibility to confirm design of a cloud-based solution to allow the users to This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 15 temporarily work on their problems using manually fed data (collected via secure phone calls by operators). Utilities may also consider CSP provided hardware for such operations. b) Mitigation of Impact of Network Connection Disruption As long as the service is hosted in the cloud, theoretically a user can access it from anywhere as long as there is access to Internet connection (secured tunnel and identity authentication are usually required by utilities) or dedicated network connection. To achieve high resiliency, it is recommended to have an independent connection each at each of the multiple locations. For example, grid operators typically have two independent control centers – the primary one and the backup one. As shown in Fig. 5 (a), each of their data centers can (c) Edge to the cloud connection connect to the cloud through independent telecommunication Fig. 5. Increase service fault tolerance through network redundancy providers. Since an outage of one telecommunication provider Other than relying on the wired network to get connected usually does not occur concurrently with an outage of another with the cloud, utilities can also employ advanced wireless provider, this provides inherent resiliency during ISP outages. communication technology such as 4G LTE-A and 5G to Switching the connection from the down provider to the healthy connect edge devices directly to the cloud. As the IoT and edge one via another control center is a general fault-tolerant solution computing technology matures, the data collected by the edge to counteract the impact of network outage. devices can be locally processed, encrypted and then Enhanced resiliency can be achieved by separate transmitted to the cloud to feed the hosted services without connections terminating on separate devices at more than one having to use utility data centers as the data hub or relay. The location. As illustrated by Fig. 5 (b), even for the same impact of wired network disruption can thus be mitigated by corporate data center, duplicate connections are established to such “edge to cloud” connections as shown in Fig. 5 (c). provide resilience against device failure, connectivity failure, For those mission-critical applications or functionalities, it and complete location failure. is wise to avoid putting all eggs in one basket as modern society cannot bear sustained power outages. Automatic Generation Control (AGC) and Security-constrained Economic Dispatch (SCED) represent two such examples of critical workloads that continuously maintain the balance between power supply and demand. If these were to move to the cloud in the future to pursue certain extraordinary advantages (even though there is no such a need at this time), we must back up the process in local non-remote infrastructure even if independent, redundant network connections are established for cloud workloads. c) Mitigation of Cloud Service Outage Impact To build fault-tolerant applications in the cloud, it is important for the users (or the solution vendor) to understand the global infrastructure of the cloud provider first. For (a) Redundant connection from utility network to the cloud example, users should grasp the concept of Region and (Availability) Zone (AZ) that are offered by major CSPs. High availability of solutions can be achieved by deploying the applications to span across multiple AZs or more than one cloud region. For each application tier (web interface, application backend, and database), placing multiple, redundant instances in distinct availability zones creates a multi-site solution. To minimize the business disruption or operational discontinuity, the utilities going to the cloud are recommended to adopt a redundant and fault-tolerant architecture which allows them to quickly switch their workloads to a different AZ in case of a single-location outage. In the meanwhile, they need to consider duplicating the infrastructure of the running workloads in a different region so that desired resources and dependencies can be spun up rapidly to continue the service (b) Enhanced redundant connection from utility network to the cloud when there is a region-wide outage. As shown in Fig. 6, utilities can increase the fault tolerance of their cloud workloads and This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 16 gain disaster recovery capability through backups in multiple identify wastage of resources since it will further help them AZs and Regions. In addition, the users should also make their optimize their cost by shutting down unused or under-utilized disaster recovery plans, e.g., identifying critical data and resources. Various CSPs offer such services to help their performing cross-region backups for them, to ensure the customers identify these idle resources, such as AWS Trusted business continuity in the face of a man-made or natural Advisor and Azure Advisor. disaster. B. Guidance for Software Vendors 1) Software Design to Support Cloud Adoption To facilitate cloud adoption for the utilities, grid software must be architected and designed to be cloud friendly. Power system software vendors and in-house developers should consider the following while designing and developing cloud- based solutions: a) Microservices instead of monolithic architecture Due to challenges with monolithic architectures for cloud infrastructures, microservices architecture has emerged as a prime candidate for maximizing the benefits of moving systems to the cloud. It brings design principles to make software components loosely coupled and service oriented. Due to module independence, any service can be scaled and deployed separately in a microservices-based application. A basic principle of microservices is that each service manages its own data. Two services should not share data storage. Instead, each service is responsible for its own private data store [34], i.e.,

Binary Large Object (BLOB) as shown in Fig. 7. The Fig. 6. Increase service fault tolerance and disaster recovery capability through modularity introduced by this architecture also offers ease and backups in multiple AZs and Regions agility to keep up with the accelerated pace of business. Besides, the modular characteristic of microservices inherently 4) Guidance on Cost Optimization enhances security and fault isolation. Utilities today are hesitant about cloud adoption due to budget concerns, but the reality is that they can take control of cost and continuously optimize it on the cloud. With a meticulous analysis of their demand profiles, the utilities can take a few steps to quickly develop a plan that meets their financial needs while building secure, scalable and fault- tolerant solutions for their business needs [32] [33]. The first step is to choose the right pricing model. CSPs usually price the resources that users request based on the capacity inventory and the term of service. For example, in the context of an elastic computing service, users may see “Spot Instance (Virtual Machine)”, “On-demand Instance (Virtual Machine) and “Reserved Instance (Virtual Machine)”. While “On-demand” normally meets users’ needs, “Reserved” pricing provides a remarkable cost reduction on top of “On-demand” if the target workload is consistent over a period of time and users would like to sign a long-term agreement (1 year ~ 3 years) with Fig. 7. Comparison of monolithic architecture and microservices architecture the CSP. And “Spot” pricing, which works like an auction/bidding process for CSP’s surplus capacity after Although microservices architecture has many advantages, meeting the “On-demand” and “Reserved” needs, offers a it is not an omnipotent solution for cloud-native applications. further discount for the users to run stateless, fault-tolerant and While bringing agility and speed to software development, it flexible applications such as big data and planning studies. sacrifices the operational efficiency as there are many more The next step is to match capacity with demand. Cost can be parts/modules than the monolithic approach. The software optimized when resources are aptly sized. In order to line up the vendors should avoid overuse of the microservices layout for capacity with the real-time demand, the utilities, or their the development of cloud-native applications. For instance, a designated solution vendors should consider an autoscaling power flow analyzer or state estimator should not adopt a strategy to dynamically allocate the resources to match the microservice architecture because both processes are tightly performance needs. coupled and other processes such as network topology Moreover, users may also want to implement processes to processors can be invoked by both programs. Due to this, This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 17 coupled processes are more likely to have a bigger computation power industry, other types of databases such as NoSQL overhead when using a microservices architecture. However, a database (key-value store, document store, graph, etc.) and time-domain simulator for transient stability assessment can be distributed databases are beginning to play a bigger role as the orchestrated using microservices architecture since these data yield capacity and locations in the grid are growing. From analyses are independent processes given the same initial the programming point of view, new architectures should be condition and can be launched in parallel. adopted in the software design to support multi-thread and b) Use of containerization to improve software multi-core processing. For instance, one important factor which portability has a significant impact on the performance of parallel Containerization is a key technology that vendors may want computing is the coordination between submodules. Methods to consider for cloud-native application development. to efficiently distribute the software functionality across these Containers allow the developers to package an application and computing resources are expected. If these considerations are its dependencies into one succinct manifest that can be easily not properly addressed during design, applications meant to be ported to a virtual machine or cloud environment without run in multiprocessor, multi-core environments could end up having to rewrite the code. It significantly streamlines the with serious, hard-to-find performance issues [36]. Finally, for deployment process because the environments for an effective HPC software design, users should be given the development, test and deployment are made consistent in flexibility to opt in/out of the parallel processing and the choice containers to remove barriers between them. to choose the computing hardware (e.g., no of cores, RAM Furthermore, containerization enables developers to scale usage etc.). the applications by changing specific components or service d) Use of stateless protocol modules while keeping the remainders of software unchanged. To make an application more compatible with microservices For instance, in a container-based design, programmers can architecture, especially via cloud, a stateless protocol should be scale databases for network data or dynamic data and considered. In general, moving to be “stateless” protocol brings corresponding modules to support the increased processing load several major benefits, including 1) removing the overhead to without scaling web server resources when the size of study create/use sessions; 2) providing resiliency against system cases varies. By means of some container orchestration tools failures and recovery strategies in the event of failures; 3) like Kubernetes (k8s), the services can be scaled up or down allowing consistency across various applications; 4) scaling automatically based on resource utilization, which lines up with processing capacity up and down to handle variances in traffic. the characteristics of cloud-native applications. Determining the best protocol for cloud-native systems It is perceptible that container technology and microservices depends on the business model and the use case. For example, architecture are tightly coupled and usually adopted together in while a stateless application is ideal for short circuit fault the design of cloud-oriented applications, since containerization analysis, a stateful design will be more suitable for the forecast provides distinct, natural boundaries between different of system load or power flow on the tie-lines in the next few microservices. However, using containers does not imply that minutes/hours because the historical context matters. In the software follows the microservices architecture. Monolithic conclusion, a software vendor should neither be over-dependent applications can be containerized too. Although both on the traditional stateful design, nor abuse stateless models. At techniques are consistently used in application modernization, present, however, power system utilities overuse stateful vendors should carefully consider them for development based protocol today and lose on advantages of stateless protocol on the software characteristics such that the final goal of when using cloud. software portability is not hindered [35]. 2) Software Licensing and Pricing Suitable for Cloud c) Enabling HPC Capabilities The pricing models for cloud-native applications need to The software vendors, particularly those who provide cooperate with proper licensing models. In addition to the solutions for power system modeling, analysis and simulation, network license model and BYOL model, the vendors should should accommodate key technologies that enable high also consider licensing the software based on subscription. The performance computing when designing their software. More subscription-based licensing model, together with the pay-by- succinctly to unleash the power of HPC infrastructure, they subscription or pay-per-use pricing modes, are designed to be must employ expertise in parallel computing algorithms design more cost-effective and convenient in the cloud. In the pay-by- and programming. From the algorithm design perspective, the subscription pricing method, users only need to pay for the use vendors need to develop new domain specific parallel of software periodically based on the agreement between them computing algorithms for compute-intensive tasks and and the vendor. The billing cycle can be daily, monthly, memory-intensive tasks, for example, large scale optimal power quarterly or annually. Regardless of how the software is flow, unit commitment, transient stability, and utilized, the subscription fee is usually fixed, varying a little bit electromechanical simulation. They should also apply powerful depending on the selected payment option. The second parallel data processing methods in their applications when approach, pay-per-use, or rather, pay-as-you-go in the context there are massive data ingestion streams. Besides, new data of cloud computing, provides users a finer granularity of control storage techniques must also be considered for the seamless over spending on the software service. The vendor only bills the transfer of data between storage bins and compute units. users based on their usage (or time) in the cloud. From the point Although the relational database is still the mainstream in the of view of users, they no longer need to worry about whether This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 18 the cloud license will be overprovisioned or under-provisioned accreditations through external audits. because they only pay for the time when the service is being 2) Critical Workload and CIP Compliance used. For power system users in North America that are subject to Unfortunately, the subscription-based licensing and the two the CIP standards, it’s valuable to consider the security discussed pricing methods that rely on users’ subscription have objectives underpinning the CIP standards. Table IV shows a not been widely implemented by power system software mapping of NERC CIP standards to CSP security categories. vendors. The utility companies might be tolerable with the old- These categories organize the key capabilities to drive a cloud fashioned licensing and pricing options in the traditional IT user’s security culture. It also helps users to structure selection environment, but when it comes to the cloud, they would expect of security controls to meet security and compliance needs. a more flexible way for software licensing and charges. TABLE IV A MAPPING OF NERC CIP STANDARDS TO CSP SECURITY EPICS C. Guidance on Compliance for Cloud Applications NERC NERC Standard Corresponding CSP Security Compliance encompasses checks to confirm that a set of Standard Description Categories BES Cyber System specified requirements are implemented and are operating as CIP-002 [Customer Determination] Categorization expected. Such requirements can be set internally within the Security Management CIP-003 Governance organization to meet business, operational, or security Controls objectives. They can also be required by regulatory standards Identity and Access CIP-004 Personnel & Training such as the NERC CIP Standards. Management Identity and Access 1) Compliance Certifications and Authorization Electronic Security CIP-005 Management Perimeters Numerous, well-established security assurance programs Infrastructure Protection exist that standardize security assessment and authorization for Physical Security of BES CIP-006 Infrastructure Protection cloud products and services, such as FedRAMP, which is Cyber Systems enforced by the U.S. federal agencies. Such security assurance Systems Security Detection CIP-007 programs such as SOC, ISO, and FedRAMP, among others Management Infrastructure Protection require continuous monitoring and audit by cloud security Incident Reporting and CIP-008 Incident Response experts. Third party assurance reports support compliance Response Planning Recovery Plans for BES demonstration for security “of” the cloud. Such certifications CIP-009 Incident Response Cyber Systems provide security assurance to cloud users as well as support Configuration Change achieving certifications of their own. For example, entities CIP-010 Management and Infrastructure Protection seeking SOC-2 certification can inherit a CSP’s SOC-2 Vulnerability Assessments certification and focus their compliance efforts in fulfilling the CIP-011 Information Protection Data Protection Communications requirements applicable to their “in” the cloud applications. CIP-012 Data Protection Networks Fig. 8 shows how users, ISVs and CSPs collaborate on Data Protection Detection assuring the security compliance authorization. Identity and Access Supply Chain Risk CIP-013 Management Management Incident Response Infrastructure Protection

Cloud solutions can meet the security objectives in CIP standards by implementing these security controls. CSPs have well-documented user guides, technical papers, guidance documentation, and interactive engagement that support their customers in implementing security controls and achieving a high level of security. Security assurance certifications and authorizations present an opportunity for use in NERC CIP. A comparison between the FedRAMP Moderate control set and NERC CIP requirements reveals that the FedRAMP Moderate control baseline encompasses all NERC CIP requirements [37]. Fig. 8. Shared security assurance NERC’s acceptance of FedRAMP as CIP compliant for security “of” the cloud can provide a streamlined compliance approach In the model as shown in Fig. 8, customer is responsible for for power utilities and support efficient audit assessment while obtaining and maintaining certifications and accreditations maintaining stringent security obligations. Registered Entities through internal/external audits (as required) for their end-user building on FedRAMP authorized infrastructure and services system leveraging CSPs and vendors, while the vendor is could focus their security controls and compliance assessment responsible for obtaining and maintaining certifications and on security “in” the cloud. accreditations through external audits (as required) for their Regulators continue to learn more about cloud technology cloud offerings. For CSPs, they need to take responsibility for and explore their roles and expectations in auditing CIP-scoped obtaining and maintaining infrastructure certifications and This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 19 workloads. Engaging with NERC compliance and/or NERC benefits for each real-world use case. The use cases cover a Regional auditors is recommended to share insights on the wide variety of power system organizations (such as ISOs, value of cloud technologies to the reliability and security of utilities, and private vendors) as well as application areas (such power utilities and on cloud security, and to increase as planning, operation and market related). The table also compliance assurance. documents an optimal choice of cloud service model for each In addition to accepting third party assessments for “of the real-world use case. Finally, the table documents the benefits of cloud” compliance, a standard revision project is necessary to cloud technology for each use case as drawn from specific clarify Entities’ “in the cloud” requirements. A drafting team business needs described in Section II. We hope that the is currently working to address requirement language for on- following use cases can serve as a guidance for other cloud premises virtualization which began in 2016. For cloud adoptees in the electric energy sector. From these success technologies, a stand-alone standard can accommodate the stories, the reader can learn how the primary concerns over the security objectives most relevant to use of cloud and simplify cloud can be addressed or how the risk while migrating to cloud applicable definitions. technology can be mitigated. As these use cases are collected To accelerate cloud adoption for regulated workloads while through a participant survey, the details represents individual standards are under development, NERC can develop audit participant’s responses and may vary from use-case to use-case. guidance to clarify audit expectations and support auditor roles. A summary of the use cases is given in Table V and a more 3) Non-critical Workload Compliance comprehensive discussion follows with various use cases Non-critical workloads refer to applications or services that categorized based on their utility. are neither mission-critical nor high-impact. In North America, A. Cloud Applications in Grid Planning: such workloads are out of scope for the NERC CIP standards, but registered entities are still required to protect any Critical 1) Planning Studies Energy Infrastructure Information (CEII) data that is involved We first consider three use-cases of cloud for planning in cloud adoption pursuant to section 215A(d) of the Federal studies from New York ISO, ISO New-England and Omaha Power Act. The best approach to protect the confidential data Public Power District. and user privacies is to follow the CSP-endorsed best security New York ISO (NYISO) – practices. NYISO started using an on-premises cluster for system 4) Internal Compliance planning studies in 2012. As the hardware aged, and workload Power industry users often establish security expectations increased, it became necessary to procure more resources to and internal compliance obligations to align and confirm obtain results in a reasonable time. However, the cost of enterprise-wide objectives. Security Frameworks such as the replacing the entire on-premises system with up-to-date NIST Cyber Security Framework (CSF) [38] provide a holistic hardware was considered cost prohibitive. Instead, transitioning approach to security controls across networks. Entities can to the public cloud service was found to be more feasible since choose to meet compliance requirements from a variety of the balance of Total Cost of Ownership (TCO) spread out over security frameworks and/or regulatory standards, and map them the effective lifetime of that server until it is decommissioned to their implemented controls. is comparatively higher. Consequently, NYISO adopted a cloud solution in 2017, which phased into production in 2018. V. USE CASES FOR CLOUD TECHNOLOGY IN POWER INDUSTRY NYISO chose AWS as the cloud provider and opted for IaaS This section summarizes a collection of real-world use cases to build its cloud solution. The solution integrated Microsoft for cloud technology in the power industry. The summary is HPC as an instance manager and job dispatcher with AWS EC2 provided in Table V. The table breaks down each real-world use service to scale out the virtual machines via EC2 for HPC jobs. case in terms of the company, application area, cloud type and The architecture of this cloud-based HPC platform is illustrated service model, upfront and operating cost and maintain effort. in Fig. 9. It also summarizes the security control scheme and featured TABLE V A BRIEF SUMMARY OF REAL-WORLD USE CASES IN POWER INDUSTRY Case Company Application Cloud Cloud Upfront Operating Mainten Security Control Featured Benefits No. Area Type Service Cost Cost ance Scheme Model Effort Cloud IAM1,2, data resources encryption1,2, usage; security group1,2, Low cost; better NYISO1 Planning Public software enforced MFA2, scalability, much less task 1, 2 IaaS No Low ISO-NE2 studies (AWS) offsite and password rotation2, completion time and job concurrent role-based access2, waiting time users HTTPS/TLS license IAM Role, data Integrated process for data Cloud encryption, preparation, model Load Public 3 ISO-NE PaaS No resources Low enforced MFA, API building, training/tuning forecasting (AWS) usage activity logging , and testing to accelerate HTTPS /TLS ML development This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 20

VPN over SSH, Low cost; data sharing AES256 data with ease, high fault Wide-area Cloud ISO-NE/ Public encryption; separate tolerance, fast data 4 monitoring and IaaS No resources Medium NYPA (AWS) subnets with NAT recovery consistent data sharing usage instance display to enable real-time collaboration HTTPS/TLS for Provide a backup solution data in transit, SSE- to ensure operational Extremely KMS for data at continuity in case of an Backup control low Public rest, access, IAM emergency, eliminate 5 ISO-NE and emergency FaaS No considering Low (AWS) User/Role, API key, human errors in manual dispatch the event is signed URL dispatch, comprehensive rare security control, surprisingly low cost Role-based access, Cluster are highly scalable Cloud Anomaly Public MFA, data and cost-effective, cluster 6 ISO-NE PaaS No resources Low detection (AWS) encryption, configuration is easy and usage HTTPS/TLS the workload is low Account or OU- Ability to view/analyze based based access, both structured and Cloud data classification unstructured data, reduced Operational Existing resources and cybersecurity cost, easy and secure data efficiency and Public PaaS software usage for scrutiny sharing with any 3rd 7 PGE customer Low (AWS) IaaS license data lake parties, self-service experience cost and reporting, data improvement analytics traceability, agile development of ML models MFA8,9, identity based access8,9, data Cloud encryption8,9, resources activity logging8,9, usage for Much 3rd party testing and Shorter development 8 DER 8 data lower Centrica , Public IaaS monitoring of cycle, better system 8, 9 aggregation and No processing, than on- AutoGrid9 (AWS) PaaS8,9 vulnerabilities8,9, scalability, simpler management storage, premises VPNs on IPsec9, DevOps management analytics solution mutual TLS9, and intrusion detection9, networking Role-based API- level access control9 Low, Unknown at the Market settlement reduced Cloud Market Public PaaS work are time of writing from 30-min to 5-min 10 AEMO No resources settlement (Azure) CaaS automate blocks with massive data usage d management capability Access control11,12, Collaborative subscriptio Low, incident response11, Scalable infrastructure11,12, system SaaS11 n fee11, fully System and modeling consistency11,12, MISO11, Public 11, 12 modeling and PaaS12 No Cloud managed communication reduced IT effort11,12, NRECA12 (AWS) hybrid IaaS12 resources by protection11,12, integrated security, easy to simulation usage12 vendor encryption for data access users’ model12 at rest11,12 Firewalls, security groups, access Quick availability for IncSys/Po Coordinated Public Depends cloud Different control, intrusion customer, scalable and 13 system or SaaS on cloud resources levels of detection, logging fault-tolerant to support werData operation drill Private type usage effort and monitoring, mission critical drills, data encryption, Access from anywhere TLS

NYISO engineers to the HPC cluster head node on AWS via a file sharing service which allows on-premises servers to access cloud storage through a dedicated network connection: AWS Direct Connect. These jobs are then dispatched to multiple virtual machines spun up by EC2 service, which runs the job in parallel. The simulation results are sent back to the file system mounted to NYISO’s corporate network for users’ download. At the time of writing, NYISO’s cloud HPC platform can support planning studies in three applications: GE MAPS, GE Fig. 9. NYISO HPC Cloud Architecture MARS and PowerGEM TARA.

As Fig. 9 shows, the simulation jobs are submitted by NYISO uses data encryption and access control to keep its cloud operation secure. All virtual resources acquired from This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 21

AWS are encrypted by NYISO. End-to-end encryption is used power system simulations. The platforms can scale out and for data transfer between NYISO and AWS data centers. down the computing resources based on the demand and utilize NYISO also enforces strong credentials and high privileges to unlimited cloud storage. But ISO-NE’s cloud platform has access the sensitive data. The identities and accesses are some unique features as well, including but not limited to, controlled by AWS adepts via IAM service. capability to run instances in multiple AZs to increase fault With this cloud technology, NYISO has achieved significant tolerance, configurable instance type depending on the study’s cost savings. It has also achieved higher scalability to meet characteristics (if it is compute-intensive or memory-intensive), various computing profiles and a remarkable reduction in study failed job rescheduling, flexible instance purchasing options run-time and job queueing. Over 1200 compute nodes are (either Spot Instance or On-demand Instance) to strike a balance available to NYISO on cloud when a job is submitted to between cost and performance. NYISO’s HPC platform, and additional cores can be requested In years of use and enhancement, the platform has as needed. The job runtime of an exemplary task is 40 minutes demonstrated significant advantages over the traditional on- on the cloud compared to 12 hours on a local computer without premises cluster for many planning studies within ISO-NE’s any parallelization. workload, including transmission needs assessment and solution study, generation interconnection study, resource ISO New England (ISO-NE) – adequacy analysis (capacity requirements and tie benefits ISO-NE began using a local compute cluster to facilitate study), NPCC Bulk Power System (BPS) test, and Forward engineers with large-scale simulations in 2005. As was the case Capacity Auction (FCA) delist study [4]. A summary of these with NYISO, ISO-NE migrated to cloud for planning studies in studies on the platform is given in TABLE VI. 2013 and today is a pioneer of cloud adoption in industry. Like NYISO’s solution, ISO-NE’s elastic computing platform was TABLE VI also architected on an IaaS model using AWS EC2 service, PLANNING STUDIES THAT RUN ON ISO-NE'S ELASTIC COMPUTING PLATFORM granting them the maximum flexibility for customization and Data Time spent on No. No. of cloud Nodes Cost performance/cost optimization. Fig. 10 shows the architecture of Nodes vs. on- Uptime ($) vs. overview of the cloud platform. Study Jobs Used prem Type PC As shown in Fig. 10, ISO-NE separates the public-facing cluster N-1-1 ~ 30 ~ 10 102 ~ 10 ~ 101 ~ web server from the back-end servers using both public and contingency 1h ~ 6h times times 104 20 102 private subnets as recommended in Section IV. A. 2) a) analysis faster faster They adopted a comprehensive security control scheme to FCA delist 103 ~ 100 ~ 100 < 1h* N/A* N/A* protect this platform from being accessed by an unauthorized study 104 200 ~ 30 ~ 10 entity. A multi-layer protection scheme in line with the NPCC BPS 103 ~ 10 ~ 1h ~ 101 ~ times times Test 104 20 12h 102 recommended security control shown in Fig. 4 was applied to faster faster the workload. At present, ISO-NE’s elastic computing platform ~ 40 ~ 15 Demand 102 ~ 101 ~ 5 ~ 20 3h ~ 5h times times hosts three primary power system software applications to curve study 103 102 support both steady-state and dynamic studies: GE MARS, faster faster 12 Tie benefit 50 ~ 5 ~ PowerGEM TARA and Siemens PSS/E. The ISO is also 5 ~ 20 1h ~ 5h times N/A* Study 103 50 evaluating the possibility of moving large-scale Electro- faster Magnetic (EMT) simulations to the cloud. One example of cloud superiority is visible in FCA delist study. Prior to cloud migration, ISO-NE had never managed to run all N-1-1 scenarios in a FCA delist study as the study is a strictly time-constrained job. The delist study allows ISO-NE to evaluate if there are any reliability issues in case of N-1-1 contingencies with the units’ delist bids considered. The ISO must finish the analysis and post the auction results no later than the second day after the auction is closed. With access to unlimited resources on the cloud and the capability to scale out almost instantly, ISO-NE has been able to successfully perform a comprehensive assessment of all possible combinations of contingencies in the study within a tight time window since 2019. Omaha Public Power District (OPPD) -

OPPD began to use HPC with an on-premises compute Fig. 10. Architecture overview of ISO-NE’s elastic computing platform cluster consisting of 448 usable cores across 14 Windows servers (Intel Xeon 8C and 12C) in total. With this cluster, a ISO-NE’s cloud platform shares many similarities with that typical P6 run [39] with 50 cases, which takes about 3 hours per of NYISO. For example, both platforms are versatile and can case, only needed a total of 6 hours to complete (~25x speed accommodate multiple software applications for large-scale This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 22 improvement) with 16 cores allocated to each case. Due to the process. The PaaS solution, named SageMaker, did a lot of success of the local cluster in terms of time reduction, OPPD heavy lifting to provide infrastructure and allows ISO-NE started trials on cloud computing. developers to focus on their own use cases. With this platform, ISO-NE data scientists quickly tried out different ML The work was done on a single-tenant virtual machine in methodologies to develop new STLF algorithms, based on Government Cloud. The Site-to-Site VPN was decision tree, random forest, support vector machine (SVM) established to secure the connection. Azure Active Directory and K-means clustering. These new ML-based approaches was used to manage identity and access in the cloud. In a series complement the current STLF tool that is based on a single of cloud attempts, OPPD explored different request parameters, method. With this approach, the effort needed for server including scaling strategies (both vertical and horizontal), provisioning, dependencies installation, environment number of jobs, and number of cores on a job, to evaluate the configuration was waived and the time spent on data speed and cost benefits that can be brought by the cloud. After preparation, model building, training and validation were these cloud trials, they concluded that cloud infrastructure greatly reduced. The well-trained model artifacts, if needed, improves overall performance of computing by giving them were stored in a special format and deployed in the on-premises more scaling flexibility, higher resource availability and environment to connect with the EMS for security compliance. requiring less maintenance effort. OPPD aims to move to a Fig. 11 shows the workflow of ISO-NE’s cloud-based ML hybrid cloud model in the future by shifting the variable portion model development process. of demand into the public cloud environment while keeping the fixed demand in house to maximize their benefits through cloud while simultaneously taking full advantage of the existing on- premises infrastructure. Such a hybrid setup, as discussed in IV. A. 1) a) , is probably the best way forward for the power system companies that are beginning to migrate to cloud as it optimally balances the portfolio of capital investment and O&M spending. B. Cloud Applications in Grid Operation Now we consider various use cases of cloud technology for grid operation. 1) Load Forecasting

ISO New England (ISO-NE) – Fig. 11. Schematic diagram of ISO-NE's ML model development flow

Load forecasting is a typical power system engineering As a public cloud offering, this ML development problem that heavily uses data-driven approaches for modeling environment has seamless integration with security control and analysis. The complexity of load forecast and its need to services. Following the best practices recommended by AWS, process large historical datasets to build high-fidelity models ISO-NE activates access control, data encryption, user/API requires high computational power as well as state-of-the-art activity logging and MFA to mitigate the potential security risk data-driven approaches. Cloud computing is a perfect resource involved with other ML tasks, even though the data used in the for both these needs. current case, system load and weather data, are not confidential. ISO New England is also borrowing power from cloud computing to expedite the development of an enhanced short- 2) Wide-area Monitoring and Data Sharing term load forecasting (STLF) program based on machine ISO New England (ISO-NE)/New York Power Authority learning (ML) approaches. (NYPA) – ISO-NE’s current STLF tool, as a built-in module on EMS, predicts the future 4-hour system demand every 5 minutes to A collaborative, easy-to-share, wide-area situational assist operators with the real-time operation. The tool employs awareness system was implemented in the cloud through a joint a Similar Day method to search the best match of the forecast effort by ISO-NE, NYPA, Cornell University and Washington horizon in the past 7 days considering many attributes such as State University [40]. weather, day of the week, season, actual system load of the In this system, the synchrophasor data are streamed from historical day, etc. The tool can meet the control room’s basic Phasor Data Concentrators (PDCs) at each system operator to needs for STLF, but it requires operators to report several inputs the cloud-based data relay and then rerouted to various when none of the past 7 days is a good match. It has been applications that are hosted in the cloud, such as Regional Data observed that the forecasted load demands derived from the ill- Repository and Real-time Phasor State Estimator (SE). The matched “similar day” were inaccurate, making it a “garbage in state estimator performs an interconnection-wide, time- garbage out” . synchronized assessment of the voltage magnitudes, phase In order to speed up the application of newer data-driven angles, and transmission line flows. It also provides common methods to improve the accuracy of STLF, ISO-NE adopted a visualization displays for real-time collaboration among the PaaS solution on AWS to streamline the development and test participating grid operators. The grid operators who subscribe This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 23 to this platform can access each other’s PMU data as well as generators verbally in a short interval to balance the system. retrieve the historical data from the central repository. Additionally, operators may also have to deal with unusual The cloud deployment of the wide-area monitoring and data times when the facilities are inaccessible, like COVID-19. sharing platform is called “GridCloud”. Its implementation Since not all essential applications for generation dispatch architecture can be found in [40]. The GridCloud system was allow remote access, how to continue operating the grid during built on AWS, particularly VPC and EC2 service, with the such scenarios is of particular interest to grid operators. To duplicated workloads in different Regions to achieve high fault provide a backup control for area balancing in the event of such tolerance and optimal performance (e.g., reduce the network emergencies, ISO-NE developed a cloud-hosted emergency latency). In this implementation, PMU data were streamed from generation dispatch solution, which is based on serverless the distribution points at ISO-NE and Cornell via encrypted computing. ISO-NE evaluated the method amid COVID-19 secure shell (SSH) tunnels to two redundant Amazon Virtual pandemic and found it to work as expected. Private Clouds (VPCs) in Virginia and Oregon. Each data The solution’s serverless architecture has been illustrated in center had 13 cloud instances, with a total average cost of $2.47 [14]. The entire platform is built upon AWS with its serverless per hour. The cloud instances were managed by Cornell’s open- service – Amazon Lambda as the cornerstone. Building upon source software CloudMake and VSync. Besides the use of Amazon Lambda, the platform also integrates other Amazon encrypted SSH to protect the data in transit, the data at rest was services for data ingestion, storage and visualization. Whenever also encrypted using AES-256 algorithm. The encryption did an abnormal event occurs that leads to area balancing add latency, but it remained within satisfactory range. The data dysfunction, the operators are required to raise an emergency consistency between the PMU data producer and the final alarm as required by the current operating procedure. This application (SE in this case) was verified against each other in alarm is used to trigger a periodic retrieval of PMU both data centers, catering to the strong need of data consistency measurements including tie-line flows, frequencies on key by grid operators. Data consistency ensures that different buses, and active power outputs of the PMU-monitored viewers see the same data, and any updates to grid state are generators, from the PMU database. In the meanwhile, other promptly evident; otherwise, grid operators might take actions data such as unit ramp rates, incremental energy offers, that are detrimental to the grid operation. operating limits and regulation limits will be pulled from the GridCloud demonstrated that the complex smart grid market database (if that database also fails, the last successfully applications with strong requirements on security, consistency, retrieved dataset is used). The data are then securely sent to an and timeliness could be supported on the public cloud. The S3 bucket (an Amazon cloud storage service) and encrypted. system used redundancy and replication to mask the various The data upload operation then triggers a Lambda function to sources of delay and jitter and helped to manage tamper-proof start calculating ACE and DDPs for the dispatchable real-time archival storage of the collected data. Moreover, the generators. The calculation results are sent to another S3 bucket overall cost of the entire system setup, which includes the where a static website is hosted with the data visualized through cryptographic security, was surprisingly low. Data-Driven Document display technology (d3.js) [41]. The 3) Backup Control and Emergency Generation Dispatch results are also written to DynamoDB, a NoSQL database service on AWS, for archiving purposes. ISO New England (ISO-NE) – The BA and the generator operators can visit their respective One of the important functions that a system operator webpages to visualize the dispatch information through HTTPS performs is balancing load and generation in real time. An links in terms of signed Uniform Resource Locators (URLs) operator which assumes such a role is also called Balancing after they authenticate themselves with a pre-defined user pool Authority (BA). Under normal situations, the area balancing in Amazon Cognito. The operators at the BA side are allowed functionality is mainly achieved through Automatic Generation to manually override the advised DDPs calculated by the Control (AGC) and Unit Dispatch System (UDS). AGC is a Lambda function if they think these values are unreasonable, critical component of the EMS. It calculates Area Control Error whereas the generator operators can acknowledge the advised (ACE) based on SCADA measurements of tie-line flows and DDPs or give a reason to decline it. Similarly, generator frequencies, computes and sends AGC set points to the operators can initiate a request to change the Unit Control Mode participating units every 4 seconds. UDS runs periodically at an (UCM) of a unit and BA can confirm this change. These user interval, usually every 10 minutes, with a look-ahead time requests are completed by API calls through Amazon API window to develop a security-constrained economic solution Gateway. All the operation activities are logged in the and the Desired Dispatch Points (DDPs) for DynamoDB database for auditing and responsibility-tracking all dispatchable resources. When an abnormal circumstance purposes as a replacement for phone conversation recording. occurs, e.g., AGC fails to work due to EMS shutdown, UDS out To protect data in transit, HTTPS with Transport Layer of service due to communication network failure, operators Security protocol is enforced. As for protection of data at rest, have to manually dispatch generators over secure phone calls the server-side encryption with AWS KMS managed keys is for area balancing. Unfortunately, such a dispatch procedure used. The data keys rotation is activated through KMS, and the involves significant overhead and manual input, making the master key is also manually rotated periodically for enhanced system operation performance vulnerable to human errors. security. Besides, it is also progressively difficult to dispatch multiple With the aid of this cloud-based solution, the potential This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 24 human errors due to the use of phone conversations for manual concern is utility engineers may not always pay attention to the verbal dispatch can be avoided. Each operation activity done alert notifications in a timely manner. When they start looking through the platform is written to a database in a clearly defined back on the frequency excursion events to analyze governor data format, which makes auditing and responsibility tracking response, it is not easy to categorize these alerts by their much easier. More importantly, the solution also enables the criticality and prioritize them for study. In order to meet operators to monitor and control the grid even when they need NERC’s requirement for BAs to submit their annual report of to work remotely away from the control room as long as they compliance with frequency response performance specified in have access to the Internet connection through an encrypted [45], ISO-NE built a scalable big data analytics platform on SSH tunnel. Thanks to adoption of the event-driven Function- AWS to enable fast and accurate identification of large as-a-Service (FaaS) model, there is no charge at the ISO under frequency deviation events from historical dataset. normal operation conditions. The overall cost of this solution is As shown in Fig. 12, the core service of the big data platform extremely low considering the infrequent occurrence of an is Amazon EMR (Elastic MapReduce), which is used to create emergency. an elastic cluster with as many nodes (EC2 instances) as needed 4) Anomaly Detection based on the projection of data growth. EMR is seamlessly integrated with service to directly access data ISO New England (ISO-NE) – stored in permitted S3 bucket as if it were a file system like Power utilities gather various types of data from day-to-day Hadoop Distributed File System (HDFS) [46]. The EMR cluster operations. As power systems modernize, many utilities/ISOs natively supports most up-to-date data analytics applications, are unable to analyze the sheer amount of new information the including big data analytics engines such as Apache Spark and grid sensor collects. This inability to analyze the massive Hadoop, machine learning platforms like Tensorflow, data quantity of data has been further aggravated as the number of warehouse and interfaces like Apache Hive, and interactive PMU devices grow [42]. The system measurements, especially notebook tools like Jupyter and Zeppelin. the PMU data, contain a substantial amount of valuable information that reflects the system conditions. Mining of these data to identify those observations that deviate from the system’s normal behavior could help power utilities uncover the hidden patterns or issues in the system. For that goal, there is a need to perform analysis on massive datasets with big data analytics methods to obtain useful information about the system. To be of practical value, however; these methods have to be much faster than the traditional data processing tools. One use case which has become ubiquitous in ISOs is the efficient identification of frequency excursions. Frequency excursion is a direct result of an imbalance between the electrical load and generation. To maintain frequency stability, it is necessary to effectively control the frequency within a defined range. The Frequency Response Initiative Report by Fig. 12. ISO-NE's scalable big data analytics platform NERC recommends all turbine-generators to be equipped with governors so as to provide immediate and sustained response to From the security control perspective, the entire platform abnormal frequency excursions. A specific frequency response adopts a Role-based Access Control (RBAC, refers to section performance requirement is also given in this report for power IV. A. 2) b) for details). The users/data analysts plants to comply with: “Governors should, at a minimum, be whoever intend to access the notebook environment are fully responsive to frequency deviations exceeding ±0.036 Hz required to login using MFA. The platform is highly cost- (±36 mHz)” [43]. Balancing Authorities (BAs) or their effective because the cluster is scalable and compute nodes are equivalent entities’ responsibility is to monitor, measure, and launched as Spot Instances. The cluster configuration is a no- improve system total frequency response as well as audit brainer. The developer can easily modify the parameters to size governor response of larger power plants. Currently many BAs the cluster and work with a massive volume of data. There are rely on FNET/GridEye [44] to keep informed of system two ways to acquire the PMU data from ISO-NE, one is through disturbance events. FNET collects data from over 300 batch upload and other through real-time streaming. For the Frequency Disturbance Recorders (FDRs) deployed in the former approach, the historical synchrophasor data in archives North America power grid. The accuracy of the frequency are uploaded to an S3 bucket periodically through either AWS measurement obtained from the FDR can reach ±0.0005 Hz. Command Line Interface or AWS API. For the streaming When a frequency measurement deviates from the system method, Amazon Kinesis is used to ingest the real-time PMU nominal frequency (60 Hz) significantly in a short period, data streams (this part is still under development). Tests show FNET is able to provide real-time alerts via emails. However, that the platform could help the engineers quickly identify the FNET is insensitive to slow frequency changes (i.e., small time of large frequency excursions occurrence when they need df/dt) and might miss the events when the frequency deviation to evaluate if the governor response satisfies NERC’s is significant, yet the rate of frequency change is small. Another This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 25 requirement [42]. accuracy improvement has also lowered the effort of field crews on updating the outage information while they are working hard 5) Operational Efficiency and Customer Experience on restoring the customers’ power supply. Improvement Moreover, PGE also harnesses the heterogeneous data that Portland General Electric (PGE) – are collected within the data lake, such as AMI meter The utilities employ the smart meter data as well as the measurements, photovoltaic (PV) panel capacity and weather customers’ input to identify the problems quickly and data to predict the BTM solar generation for better distributed accurately in their service territory. The comprehensive resource planning (DRP) [48]. This helps properly plan out the analytics helps them make decisions about when to take grid assets to meet the customer’s demand when using cheaper preventive measures after a power pole health inspection or clean energy. estimate the service restoration time based on the statistics of PGE applies a comprehensive cybersecurity audit on the historical records. These contribute to the improvement of the cloud platform and any data used on it. They discriminate data utilities’ operational efficiency and customer experience. PGE based on their confidentiality, classify their accounts is one of the utilities which pioneered using cloud to accomplish (Organization Units, or OU) in terms of purpose of usage, e.g., these goals. DEV, TEST, and PROD, and implement identity-based access PGE moved most on-premises databases that spread across to these accounts. The maintenance effort of this solution can the company to a centralized data lake in the cloud, which be fully automated through the cloud and is claimed to be less enabled PGE to perform advanced analytics on massive amount time intensive. The data management team, which consists of 7 of heterogeneous data and roll out multiple new digital channels employees, was overwhelmed by the on-premises data such as Intelligent Voice Assistant (IVA), website warehouse maintenance work. Now they can maintain both the improvement, mobile app, and proactive outage notification. on-premises and the cloud workload effortlessly. The cloud adoption at PGE is strongly motivated by the goal to C. Cloud Applications for Other Business Needs provide data driven insights for strategic planning and to avoid 1) DER Aggregation and Management monolithic architecture from the solution design perspective. The direct outcome of this change is better customer As a prevailing solution for DER aggregation and unified engagement and higher operational efficiency. management, Virtual Power Plant (VPP) has drawn a lot of PGE developed two platforms for cloud-based analytics: a attention. A VPP is a centralized platform that takes advantage fine-grained API microservices platform for database migration of information and communication technologies (ICTs) of the on-premises customer outage management system to and Internet of things (IoT) devices to aggregate the capacities reduce customer friction and a data lake for performing of heterogeneous DERs to form "a coalition of heterogeneous advanced analytics and machine learning tasks. The overview DERs”. It acts as an intermediary between DERs and the of the database migration solution and the architecture of the wholesale electricity market and trades energy on behalf of data lake can be found in [47]. Through the cloud-hosted DER owners who by themselves are unable to participate in that customer outage management solution, PGE customers are market. VPP is a cloud-native platform, which means it provided an omni-channel experience to log in from any of the utilizes cloud computing to "build and run scalable applications channels, the mobile or web, to perform a function like making in a cloud environment”. Centrica and AutoGrid are two a payment, viewing or report an outage, or to manage their examples whose VPP solutions are built on public cloud products and programs. It is said that the platform helps PGE services. reduce the planned outage time of the customer support service Centrica – by 1/10 and hence increase the customer satisfaction. On the Centrica has been a solution provider for distributed assets other hand, the data lake helps PGE sunset some of their on- monitoring and management for 10 years. Centric’s VPP premises data marts/warehouses and is expected to save more platform has 1.7GW of DERs under management. The product, than 3 million dollars in the next couple of years. It gives PGE named FlexPond, delivers an industry-leading reliability to the ability to isolate and increase compute power independent ancillary service markets across Germany, UK, Belgium and of storage. One highlighted example is sharing of AMI billing France. The solution architecture of FlexPond can be found in data with the distributed energy resources partners, which can [49]. be done in a matter of minutes in comparison to a few days Centrica adopted AWS as the cloud provider and built the when done on-premises systems. Another example is the solution using services such as AWS IoT Core and AWS IoT reduction in time (reduced by 168 hours/year) for customers to Greengrass. IoT Core lets connected devices interact with cloud report their energy usage with The Environmental Protection applications and other devices securely and effortlessly. IoT Agency (EPA) to track their greenhouse gas emissions. Core can support billions of devices and trillions of messages Furthermore, with access to ML models that can be quickly and can process and route those messages to AWS endpoints prototyped and deployed to seamlessly integrate with the data and to other devices reliably and securely. IoT Greengrass lake, the data scientists at PGE are able to improve the accuracy seamlessly extends AWS to devices so they can act locally on of Estimated Outage Restoration Time (EORT) from 24% to the data they generate, while still using the cloud for 59%, thus reducing about 700,000 unnecessary messages sent management, analytics, and durable storage. Through the to the customers for EORT updates at the onset of outages. The public cloud offerings, Centrica was able to shorten the product This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 26 development time, increase the system scalability and simplify network intrusion detection, AutoGrid uses Amazon the DevOps flow. According to Centrica, much lower GuardDuty. maintenance effort is needed in this cloud solution compared to AutoGrid practices cybersecurity from the ground up and the on-premises solution. builds their grid products baked in with cybersecurity Centrica secures all communications, stored data and capabilities and controls. The security challenges mentioned continuously monitors the system integrity. The VPP platform above as well as demonstrating compliance requirements are is not accessible from the public Internet. Any authorized addressed by appropriate security controls implemented by access requires two steps: login to a private network and two- AutoGrid in collaboration with AWS cloud infrastructure. Like factor authentication. All communication between the users and Centrica, these security controls are also verified and attested the platform is encrypted with strong cryptography. The data by an independent third-party auditor. associated with the DERs management are stored in secure, ISO 2) Market Settlement 27001-certified data centers. The access to the data is strictly Australia Energy Market Operator (AEMO) – limited based on the user identity and corresponding access AEMO is Australia’s independent energy markets and rights. The platform uses cloud services to log, monitor and power systems operator, and system planner. One of its retain account activity related to actions on data. Besides, it also principal responsibilities is settlement of the $16 billion-plus employs third party expertise for testing and monitoring of National Energy Market (NEM) which connects the grids of vulnerabilities in the system. eastern and southern Australia states and territories to create a A case study at Terhills in Belgium has shown that wholesale energy market. Retailers and wholesale consumers Centrica’s VPP is able to aggregate a diverse range of pay AEMO for the electricity they use, and AEMO then pays heterogeneous DERs from generation (combined heat and the generators. power), storage (batteries), micro-production (wind and solar), Currently, the National Energy Market is dispatched in five- and flexible loads (residential, industrial and commercial). minute slots – but only settled every 30 minutes due to Centrica launched a 32-megawatt virtual power plant, with a limitations in its ability to segment the data. AEMO has been distribution grid connected to the 18.2-megawatt Tesla working with Microsoft and partner Tata Consultancy Services Powerpack storage system in the Terhills project. The installed to transition from a legacy settlement system to its big data VPP has been providing primary frequency response to the Metering Data Management solution built on Microsoft Azure Belgian TSO since April 2018 and stacking value with cloud. Cosmos DB is the main data store, leveraging Azure additional participation in the real time balancing market. This Kubernetes Services for the application and runtime layers. The battery project is unique because it is included in a larger ability to easily manage huge amounts of data allows settlement flexibility portfolio, which results in a 1.5 times higher revenue in five-minute blocks for AEMO. The switch to five-minute stream for the battery, compared to the base case where the settlement (5MS), scheduled for October 2021, aligns the battery is monetized standalone. dispatch and settlement times and is expected to remove a AutoGrid – number of potential market barriers for renewable energy AutoGrid VPP aggregates customer-owned flexible storage, providers for bidding and dispatching as well as encourage distributed generation and demand-side resources to monetize additional innovation. The solution should provide a stronger resources in multiple energy markets and turn them into cash incentive for market participants to respond to the rapidly generators. With state-of-the-art forecasting and asset- changing dynamics of the electricity market [52]. optimization capabilities, AutoGrid VPP allows utilities and aggregators to create additional flexible capacity and extract 3) Collaborative System Modeling and Hybrid Simulation maximum value from flexible resources in lucrative markets around the globe. Midcontinent ISO (MISO) – AutoGrid also chose AWS as the CSP to build its VPP The trifold role of MISO requires accurate power system solution upon a vast range of services and features to secure the models for operations, market, and planning activities, which workload and demonstrate the compliance with security becomes increasingly challenging as volume of model data standards. The VPP platform [50] is hosted on a secure logical increases and frequency of change accelerates due to network and access to it is limited to ports from known IPs integration of distributed renewable resources and dynamic using security groups and network access control lists. AWS model refinement. IAM is used for fine-grained, API-level control of remote Currently, MISO updates the operation models on a access for administration of cloud assets. The software itself quarterly cycle. However, the members of MISO are expecting runs on a highly scalable and resilient Kubernetes architecture more frequent model updates. MISO has multiple disjointed on Amazon Elastic Kubernetes Service (Amazon EKS) within customer requests for the same model data in operation and the protection of an Amazon VPC. All data in Amazon S3 is planning horizons. The ISO gets model data multiple times in encrypted with keys that are managed by the AWS KMS, which different formats. As a result of which, there is a fair chance implements FIPS [51] compliant modules and uses AWS IAM that discrepancies exist between the online and offline model to control access to keys. For logging and monitoring, the when processes are out of synchronism, creating a barrier for solution uses AWS CloudTrail and Amazon CloudWatch for model management with continuous increases in volume and alarms and notifications that help with incident response. For frequency of change of modeling data. Power system model This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 27 errors inevitably occur due to manually intensive and by cloud infrastructure provides a better outreach and support duplicative processes, resulting in inefficient use of human to the user community without having to access or modify their resources and underutilization of energy resources in the systems. generation fleet. By means of investing in transformative OMF.coop software has been in production for a couple of modeling processes that leverage the state-of-the-art cloud years, most of the maintenance has been automated. While technology, MISO can ensure the power system models are some man-hours are spent per month to verify new code accurate, synchronized and consistently updated across ISO deployments, the overall cost of the cloud-hosted solution is activities, and transparently provided to its members in shorter insignificant. update cycles and with reduced errors. 4) Coordinated System Operation Drill The new solution that MISO is going to adopt for network IncSys/PowerData – model management (NMM), is built on top of AWS public IncSys and PowerData are strategic partners that lead a cloud services. The solution is delivered to MISO from its cooperative effort in developing a software solution to help vendor Siemens via a SaaS subscription, bringing the ISO a power system organizations of all sizes train and prepare system scalable infrastructure, consistent application performance, operators to ensure the reliability of the bulk power system. reduced need for IT support, as well as security assurance from Their training software, PowerSimulator, allows NERC entities both the cloud provider and the vendor. The SaaS model shifts to test their restoration plans with voltage and frequency all IT-related maintenance effort to the vendor, while MISO response. It allows simulation of the most devastating system only needs to grant the use of this system to its members and events, such as wildfires, hurricanes, tornadoes, ice storms, provide necessary training to them. The system has been earthquakes, and gas curtailment due to cold snap. scheduled to go into production in late 2021. PowerSimulator also helps these organizations to test and Although users have concern over data security, the vendor develop the proficiency of their system operators on the tasks has a vested interest in providing the highest level of data they perform under normal and emergency operations. The security. The NMM system has changed logs to support audits representative users of PowerSimulator include ISO New of access to model data whenever necessary. On the other hand, England, FRCC, PJM, WECC, Central Maine Power, Vermont MISO takes security of systems in the cloud seriously. The Electric, NSTAR, National Grid, etc. security requirements that have to be met cover the following PowerSimulator allows users to simulate and train on their areas: Access Control, Audit and Accountability, Awareness own system with a high level of realism. For example, they can and Training, Configuration Management, Contingency start and redispatch generation and see the effect on Planning, Identification and Authentication, Incident Response, equipment/path MW, MVAr and MVA flows, as well as bus Maintenance, Media Protection, Personnel Security, Physical voltages and angles. They can also control voltage with and Environmental Protection, Planning, Risk Assessment, transformer taps, generator kV set points, shunt capacitors, Security Assessment and Authorization, System and shunt reactors and SVCs. Operators can develop, train on and Communication Protection, System and Information Integrity, test switching orders before they are executed. Every type of System and Services Acquisition. bus configuration including main and transfer, breaker and half, National Electric Cooperative Association (NRECA) – double breaker double bus, double bus single breaker, ring bus NRECA is a national non-profit service organization that and single bus single breaker can be modelled in the simulator. serves over 890 rural electric utilities. NRECA offers Open Being a native , PowerSimulator is fast and Modeling Framework (OMF.coop) at no cost to its member runs in the most modern browsers such as Edge, Chrome and utilities via a combination of internal and federal funding. The FireFox. The cloud allows for simulation participation from any OMF.coop is a software development effort led by NRECA site with Internet access and a modern browser. Bringing with a goal of making advanced power systems models usable realism to drills and training as participants access the software in the electric cooperative community. OMF.coop addresses the from their normal place of work using the same communication lack of versatile modeling tools that would enable utilities to tools they would in a real contingency. Three-way evaluate smart grid components using real-world data prior to communication is heavily exercised by multiple NERC entities. purchase. NRECA has adapted the OMF to meet utilities' need All participants work on a real-time interactive session playing for a tool that can combine and analyze data resulting from the their actual roles. The actions of each operator and its impact integration of new renewable resources such as wind and solar, on the system are seen in real-time. as well as other distributed energy resources. This enhanced PowerSimulator is offered to its clients as a SaaS model. It modeling tool can support co-op investment decision-making is normally provided on a public cloud where IaaS services such by modeling the cost and benefits and incorporating as network, compute, storage and security are utilized to build engineering, weather, financial and other data specific to the co- this cloud-based solution. It can be also hosted in a private cloud op. or in a virtualization environment through VMWare and Hyper- OMF.coop solution is hosted in the AWS public cloud and V. However, moving to the public cloud in this case does bring uses a combination of IaaS (Amazon EC2) and PaaS (Amazon several benefits over other platforms according to IncSys and SES, Amazon S3) services. The on-demand features of the PowerData. First, the public cloud allows for quick availability. cloud allow OMF.coop to easily scale the solution based on the Single-tenant virtual machines can be easily and quickly set up user growth. Exposing the software via a web interface backed and brought online to meet growing customer demands, This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 28 allowing end users to test new add-ons, custom features and contributors are Sung Jun Yoon, Uma Venkatachalam, and other necessary upgrades in an independent test environment Aravind Murugesh from PGE, Xing Wang from Centrica, Scott without the need to revert these changes later. Secondly, Public Harden from Microsoft, Michael Welch from New York ISO, cloud is a scalable, reliable and secure infrastructure providing Michael Swan from OPPD, David Pinney from NRECA, David peace of mind over hardware failure during mission critical Duebner from MISO, and Chris Mosier from IncSys. drills. It allows utility IT departments to focus on maintaining their own EMS and SCADA without being loaded up with VIII. REFERENCES training simulator issues. The scalability brought by public cloud allows for more than one hundred users to participate in [1] Cybersecurity & Infrastructure Security Agency, "Critical one drill, creating a sense of realism. Multiple users can Infrastructure Sectors," [Online]. Available: https://www.cisa.gov/critical-infrastructure-sectors. [Accessed 31 3 participate in the roles they would perform in real life. 2021]. Additionally, going to the public cloud eliminates the traveling [2] M. Mingas, "Cloud seen as critical infrastructure by 94%," [Online]. costs of operators who participate in the coordinated drill as Available: https://www.capacitymedia.com/articles/3827693/cloud- training can be done online. seen-as-critical-infrastructure-by-94-. The simulation training workload hosted in the cloud is [3] McKinsey, "Creating value with the cloud," Digital McKinsey, 2018. secured by a multifaceted protection scheme via following the [4] X. Luo, S. Zhang and E. Litvinov, "Practical deisgn and best practices for cloud-level application security. The implementation of cloud computing for power system planning studies," IEEE Transactions on Smart Grid, pp. pp. 2301-2311, 2018. protection includes: a) minimizing public exposure and attack [5] NIST, "The NIST Definition of Cloud Computing," U.S. Department surface: servers respond to Internet requests only for the of Commerce, 2012. [Online]. Available: authorized service, cloud-vendor firewalls and security groups https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800 are used as an additional layer of protection, and access can also -145.pdf. be restricted from specific organizations; b) Penetration testing [6] B. Bhattarai, S. Paudyal, Y. Luo and e. al., "Big data analytics in smart grids: state-of-the-art, challenges, opportunities, and future is performed at regular intervals; c) Agents run on each server directions," IET Smart Grid, vol. 2, no. 2, pp. 141-154, 2019. to continually monitor the system and detect possible [7] D. C. Park, M. A. EI-Sharkawi, R. J. Marks, L. E. Atlas and M. J. intrusions; d) System-level logs are transferred to a service that Damborg, "Electric load forecasting using an artificial neural allows for managed retention and review; e) All CEII data is network," IEEE Trans. Power Syst., vol. 6, no. 2, pp. 442-449, 1991. encrypted while at rest, and while in transit between persistent [8] Z. Z. Zhang, G. S. Hope and O. P. Malik, "Expert systems in electric systems - a bibliographical survey," IEEE Trans. Poewr Syst., vol. 4, stores, server, and end users; f) Requires modern transport layer no. 4, pp. 1355-1362, 1989. security (TLS) for encrypted communication, server [9] N. Sharma, P. Sharma, D. Irwin and P. Shenoy, "Predicting solar identification and client-side authentication; g) Auditing of all generation from weather forecasts using machine learning," in IEEE automated and manual activities affecting the configuration of Intl. Conf. Smart Grid Communications, Brussels, 2011. cloud resources. [10] B. Hooi and e. al., "Gridwatch: Sensor placement and anomaly detection in the electrical grid," in Joint European Conf. Machine Learning and Knowledge Discovery in Databases, Dublin, 2018. VI. CONCLUSION [11] R. Mitchell and I. Chen, "A survey of intrusion detection techniques Cloud computing is a mature technology, which many for cyber-physical systems," ACM Computing Surveys, vol. 46, no. 4, industries are aggressively adopting to meet their business pp. 1-29, 2014. needs. In the power industry, cloud technology is appealing to [12] X. Zhou, S. Wang, R. Diao, D. Bian, J. Duan and D. Shi, "Rethink AI-based Power Grid Control: Diving Into Algorithm Design," an increasing number of practitioners as well. Through cloud- NeurIPS, Online, 2020. based technologies, power system organizations can realize [13] D. Anderson, T. Gkountouvas, M. Meng, K. Birman, A. Bose, C. their digital transformation goals effortlessly. They can also Hauser, E. Litvinov, X. Luo and Q. Zhang, "GridCloud: Infrastructure adapt to the grid modernization process. However, there are for Cloud-Based Wide Area Monitoring of Bulk Electric Power Grids," IEEE Transactions on Smart Grid, vol. 10, no. 2, pp. 2170- several bottlenecks to the use of cloud in power systems due to 2179, 2018. common misconceptions about the technology and concerns [14] S. Zhang, X. Luo and E. Litvinov, "Serverless computing for cloud- about service quality, cost, security, as well as compliance. This based power grid emergency generation dispatch," International paper aims to quell these misconceptions and provides guidance Journal of Electrical Power & Energy Systems, vol. 124, p. 106366, to overcome these bottlenecks. It also puts forth a body of 2021. [15] Dremio, "What is a Cloud Data Lake," [Online]. Available: knowledge to assist regulatory bodies with development of https://www.dremio.com/data-lake/cloud-data-lakes/. [Accessed 2021 standards and guidelines that relate to cloud adoption and 12 04]. compliance attestation. We believe that with the [16] B. Kellison, "The next five years will see massive distributed energy aforementioned challenges addressed appropriately, cloud will resource growth," [Online]. Available: become a widely accepted technology and play a significant https://www.woodmac.com/news/editorial/der-growth-united-states/. [Accessed 20 4 2021]. role in digitalizing the power industry. [17] Energy Networks Australia, "Electricity Network Transformation Roadmap," CSIRO and Energy Network Australia, 2017. VII. ACKNOWLEDGMENT [18] AutoGrid, "Harness DERs to enhance grid operations," [Online]. The authors would like to thank a group of the Task Force Available: https://www.auto-grid.com/products/derms/. [Accessed 20 4 2021]. members for providing the facts about how they use cloud- based solutions to address their specific business needs. These This is a preprint version with comprehensive discussion on use cases. A reduced version has been submitted to IEEE Transactions on Smart Grid 29

[19] EnSync, "Internet of Energy Platform Aggregates and Monetizes [42] S. Zhang, X. Luo, Q. Zhang, X. Fang and E. Litvinov, "Big Data Distributed Energy Resources," [Online]. Available: Analytics Platform and its Application to Frequency Excursion https://www.prnewswire.com/news-releases/ensync-energys-der-flex- Analysis," in IEEE PES General Meeting, Portland, 2018. internet-of-energy-platform-aggregates-and-monetizes-distributed- [43] NERC, "Frequency Response Initiative Report," NERC, 2012. energy-resources-300423096.html. [Accessed 20 4 2021]. [44] UTK, "FNET/GridEye," [Online]. Available: [20] Leap Energy, "Turn your energy resources into revenue," [Online]. http://fnetpublic.utk.edu/. [Accessed 5 4 2021]. Available: https://leap.energy/. [Accessed 20 4 2021]. [45] NERC, "Frequency Response and Frequency Bias Setting, NERC Std. [21] helics.org, "HELICS documentation," [Online]. Available: BAL-003-1 R1," NERC, 2013. https://docs.helics.org/en/latest/. [Accessed 4 5 2021]. [46] Amazon Web Services, "using Amazon S3 as a data store for Apache [22] Amazon Web Services, "Navigating GDPR Compliance on AWS," 12 HBase," [Online]. Available: https://aws.amazon.com/about- 2020. [Online]. Available: aws/whats-new/2016/11/amazon-emr-now-supports-using-amazon- https://d1.awsstatic.com/whitepapers/compliance/GDPR_Compliance s3-as-a-data-store-for-apache-hbase/. [Accessed 6 4 2021]. _on_AWS.pdf. [47] S. J. Yoon, U. Venkatachalam and A. Murugesh, "PGE's Digitial [23] Microsoft Azure, "Shared responsibility in the cloud," 3 2 2021. Journey to Reduce Customer Friction and Improve Operational [Online]. Efficiencies using AWS and Snowflake," [Online]. Available: [24] S&P Global Market Intelligence, "Rate Base: Understanding A https://sites.google.com/view/cloud4powergrid/sharing. [Accessed 23 Frequently Misunderstood Concept," 3 3 2017. [Online]. Available: 6 2021]. https://www.spglobal.com/marketintelligence/en/news- [48] C. D. Feinstein and J. A. Lesser, "Defining Distributed Resource insights/research/rate-base-understanding-a-frequently- Planning," The Energy Journal, vol. 18, no. Special Issue, 1997. misunderstood-concept. [Accessed 31 3 2021]. [49] X. Wang, "Cloud based Virtual Power Plant and Local Energy Market [25] Google, "Taking the Cloud-Native Approach with Microservices," Platforms," [Online]. Available: 2017. [Online]. Available: https://cloud.google.com/files/Cloud- https://sites.google.com/view/cloud4powergrid/sharing. [Accessed 21 native-approach-with-microservices.pdf. [Accessed 31 3 2021]. 6 2021]. [26] NERC, "CIP Standards," [Online]. Available: [50] R. Banerji and N. Dhanasekar, "How AutoGrid Supports Compliance https://www.nerc.com/pa/Stand/Pages/CIPStandards.aspx. Using AWS Cloud Security Services," 04 01 2021. [Online]. [27] AWS, "AWS Well-Architected Framework," [Online]. Available: Available: https://aws.amazon.com/blogs/industries/how-autogrid- https://docs.aws.amazon.com/wellarchitected/latest/framework/welco supports-compliance-using-aws-cloud-security-services/. me.html. [51] NIST, "Security Requirements for Cryptographic Modules," NIST, [28] Azure, "Microsoft Azure Compliance Offerings," [Online]. Available: DoC, Gaithersburg, 2001. https://azure.microsoft.com/en-us/resources/microsoft-azure- [52] Microsoft, "AEMO sparks data driven, AI infused energy compliance-offerings/. transformation," [Online]. Available: https://news.microsoft.com/en- [29] AWS, "AWS Compliance Programs," [Online]. Available: au/features/aemo-sparks-data-driven-ai-infused-energy- https://aws.amazon.com/compliance/programs/. transformation/. [Accessed 4 5 2021]. [30] U.S. General Services Administration, "FedRAMP," [Online]. Available: https://www.fedramp.gov/program-basics/. [Accessed 2021 14 06]. [31] Wikipedia, "IPSec," [Online]. Available: https://en.wikipedia.org/wiki/IPsec. [Accessed 16 6 2021]. [32] AWS, "AWS Cost Optimization," [Online]. Available: https://aws.amazon.com/aws-cost-management/aws-cost- optimization/. [Accessed 21 6 2021]. [33] Azure, "Optimize your Azure costs," 21 6 2021. [Online]. Available: https://azure.microsoft.com/en-us/overview/cost-optimization/. [34] Microsoft Azure, "Data considerations for microservices," 25 2 2019. [Online]. Available: https://docs.microsoft.com/en- us/azure/architecture/microservices/design/data-considerations. [35] D. Linthicum, "You've heard the benefits of containers, now understand the challenges," [Online]. Available: https://techbeacon.com/enterprise-it/youve-heard-benefits-containers- now-understand-challenges. [Accessed 31 3 2021]. [36] IBM, "Considerations in software design for multi-core multiprocessor architectures," [Online]. Available: https://developer.ibm.com/technologies/systems/articles/au-aix- multicore-multiprocessor/. [Accessed 31 3 2021]. [37] Microsoft Azure, "NERC CIP Standards and Cloud Computing," Microsoft, 2019. [38] NIST, "Cybersecurity Framework," [Online]. Available: https://www.nist.gov/cyberframework. [Accessed 23 6 2021]. [39] NERC, "Standard TPL-001-4 — Transmission System Planning Performance Requirements," NERC, Atlanta, 2014. [40] e. D. Anderson, "Cloud-based Data Exchange Infrastructure for Wide Area Monitoring of Bulk Electric Power Grids," in GIGRE meeting, Paris, 2018. [41] d3js.org, "Data-Driven Documents," [Online]. Available: https://d3js.org/. [Accessed 23 6 2021].