Lake Sarez: Interactive Crisis Management on the Highest Dam in the World

Marc F. Muller

Abstract: This paper evaluates the reliability of a natural dam safety system in a crisis situation. The system assessed in this work is the early warning and population evacuation plan linked to the Usoi natural dam on Lake Sarez in Tajikistan. Based on experts’ opinions, both the capacity of the system to issue flash early warning alarms, and its capacity to safely evacuate the population is found satisfactory in most crisis scenarios. Yet situations involving the malfunction of key hardware components are found to create crises that may compromise the reliability of the system. Recommendations are issued to favor a proper interactive management of these crises by improving the performance of the human components of the system. Such recommendations include the requirement of High Reliability Organization standards as well as an emphasis on training and selection. Finally, the suggestion is made to put the emphasis on the education, training, local knowledge and judgment of the local population, rather than solely relying on the technology embedded in the system’s hardware, for such a crucial matter as flood safety.

Subject Headings: Interactive Systems, Dam Safety, Risk Management, System Reliability

I. Background

A. Introduction insuring a proper maintenance and operation of the automated elements, or by insuring the interactive This paper is written to apply the principles of management of unknown unknowable events in crisis interactive approaches to assess and mitigate the risks situations. A meta-study of more than 600 engineering linked to human and organizational factors in the failures showed that merely 20% of the accidents management of a rapidly evolving crisis. Specifically, where directly due to an intrinsic failure of the the study will be conducted in the context of the safety engineered system (Bea 2008). The remaining 80% of the highest dam in the world (Palmieri, 1999), Usoi could be linked to a human malfunction. These natural landslide dam on Lake Sarez, in the Tajikistan individual human malfunctions can be linked to Pamir Mountains. several deeper roots, including organizational malfunctions, procedure flaws, dysfunctional hardware or an inadequate working environment. They either directly cause the system failure (extrinsic cause), or cause a situation prone to an intrinsic failure. In the case of Lake Sarez, due to the extreme remoteness of the location, a correct maintenance is especially critical to prevent an intrinsic failure of the system. Some informal concerns having been expressed about decreases in the quantity of qualified local operating personal (Droz 2008, Personal communication), this paper intends to evaluate the robustness of the interactive management of the system in the event of a crisis caused, directly or indirectly, by a human Fig. 1. Satellite Image of Lake Sarez, Tajikistan malfunction. [http://veimages.gsfc.nasa.gov/2388/ISS002-ESC-7771_lrg.jpg] B. Interactive Crisis Management

A crisis can be defined as “ a developing sequence of Although recent studies (Droz 2007) revealed a events in which the risks associated with the system probability of failure comparable to the safety increases to a hazardous state […] and occur when admitted on engineered man-made dams, the high improbable events are joined and produce an consequences of a hypothetical failure on the whole evolutionary and interactive complexity in the region due to its size are such that a state of the art performance of a system.” (Bea 2008). Such situations early warning system has been installed between 2000 are especially likely to occur in high risk, high and 2006 with international funding. The system is uncertainty systems such as those destined to mitigate designed to issue several alarm levels at the detection natural hazards. Although such systems are designed of an early failure sign and includes a direct to correctly mitigate the targeted natural hazard (in evacuation alarm in the downstream villages in case a which case no crisis occurs as such), there are chances is detected. Although the system is highly that the standard procedures and processes be automated, human factors still remain important disturbed, leading to an unexpected abnormal situation components to its reliability, amongst other things, by

Human and Organization Factors: Quality & Reliability of Engineered Systems- CE290/2008/1 where the system is unable to deliver the outcome for behind each accident is the order of 10 to 100 near which it was designed. Such state is a crisis situation. misses. Several authors have studied the common As mentioned above, the disturbance leading to crisis characteristics of these resilient systems (Bea, Weick, is most of the time a combination of events. These Lagadec, Klein, Miller, Pidgeon: all quoted in Bea events are generally characterized by a high 2008) and strategies designed to successfully unpredictability (hence a low probability of “engineer and manage the unexpected”. occurrence) and high potential consequences. They are The current risk mitigation strategy for Lake Sarez is often fundamentally the result of “human operators proactive by nature. The strategy consists in limiting ‘pushing the envelope’ and thereby breaching the the consequences of a potential future failure through safety defenses of an otherwise safe system” (Bea early detection instrumentation and an efficient 2008). Such human malfunctions are often violations evacuation plan on one side, while a long-term risk whereby people “do what they should not be doing” mitigation component is designed to eventually reduce (Dougherty 1995, quoted in Bea 2008). Such the likelihood of future risks. Yet, the interactive violations could include the failure to properly aspect of risk mitigation is a key aspect where human accomplish the required maintenance tasks of the factors are of highest importance. Indeed, given a safety systems. Another common source of disturbing crisis situation, the ultimate success of its adaptive events can be classified as unknown unknowable and management almost always relies on human include the occurrence of unpredictable damaging performance. This parameter is often not enough taken event, such as the simultaneous occurrence of a third into account in mitigation strategies, perhaps due to its natural hazard. difficulty to quantify and to address (humans are Such chain of events creates a crisis situation, which difficult to engineer). The proactive component of in a natural hazard context, often turns out to be a Lake Sarez risk mitigation plan being designed and rapidly evolving crisis, whereby time is a critical operational, I would like to study the interactive factor. In the case of Lake Sarez, numerical models aspects of the mitigation strategy, by focusing on its expect a massive flood to reach Barchidiv, the four key sequential components, the so called OODA uppermost village of the valley, less than 30 minutes loop (Orr 1983, quoted in Bea 2008) : observation and after its generating event (Zaninetti 2000, quoted by crisis detection, orientation and sense making, Schuster 2004). Indeed, rapidly evolving crises are decision and action. characterized by the time factor and the urge to rescue Therefore, the system’s ability to successfully manage the system as quickly as possible. Such situation an occurring natural hazard will be evaluated, in the produces a considerable stress to the human operators, event of an unexpected disturbance leading to a crisis which often results in the degradation of cognitive situation. The system’s operational robustness to the performances, eventually leading to vagabonding, unknown will be assessed. retreating and cognitive black outs in extreme cases (Bea 2008). As a matter of fact, rapidly evolving crisis C. Lake Sarez Case Background lead to a situation where the cognitive abilities of the human operators are often minimal at the very The following section gives a background for the moment they are most solicited. Furthermore, if the studied case. Although all the given information is not occurrence of the crisis situation itself can be linked to strictly linked to the management of rapidly evolving the already poor overall reliability of the human factor crisis, broad background knowledge of the (e.g. due to limited manpower, as it may be for the institutional and local context is required to studied case), there ensues a dangerous situation understand the bigger picture and properly address the almost prone to disaster. focused topic. Three complementary approaches are known to mitigate the risks linked to a system (Bea 2008). A 1. Situation proactive approach encompasses the actions taken to prevent the occurrence of a crisis, whereas a reactive Lake Sarez is located in the semi autonomous Gorno approach includes the application of lessons learned Badakhshan region, in the southeastern part of the from passed mishaps and near failures to prevent former Soviet Republic of Tajikistan in Central Asia. future crisis. The dam is located in the Bartang valley in the Pamir An interactive approach to risk management includes mountain range, which is counted among the highest the group of actions taken to restore the system to its and least accessible mountain ranges in the world. operational state, given an occurring crisis. In other Among the common problems faced by the local words, an interactive approach consists in strategies to population of such remote and mountainous areas is turn potential catastrophes into near misses. It is an extreme social, economic and political isolation impossible to totally control, through proactive and that is exacerbated by the difficulties arising from the reactive measures alone, the chain of such events as transition from Soviet rule (Schuster 2004). Moreover, unknown unknowables and human malfunctions that the inaccessibility of the region is notorious, as a two are prone to lead to a crisis. Fortunately, all crisis do days trip through ill maintained mountain trails is not lead to failures, and several examples can be given necessary to link the region to Dushanbe, the of systems that interactively manage with success country’s capital city. In addition to its isolation and periodically occurring unexpected crises, including inaccessibility, the area displays an extremely high medical emergency services, commercial aviation, and seismic activity, coupled to a harsh continental natural hazard mitigation. According to Bea (2008), climate. As a result, the region is a

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/2 prone area, where , , and This early warning system was based on the stone are common. combination of hydrometric measurements and visual The lake was formed in 1911 in the aftermath of a 7.6 observations for flood detection. The information was magnitude earthquake causing an enormous landslide transmitted through cable and satellite connection to that blocked the Murgrab river valley (Schuster 2004). Moscow and Dushanbe, where the decision could be The river waters rose to form a 17 cubic kilometer taken to evacuate the highly populated lower Amu lake, comparable to half of the size of lake Geneva, Darya basin. Yet no plan was designed to alert and flooding the upstream valley on 60 km, with a water evacuate the 19’000 inhabitants of the Bartang valley level at 3260 meters above sea level. With a height of that were considered the most at risk (Palmieri 2000). 550 meters to the lowest point of its crest, Usoi dam is Moreover, at the fall of the Soviet Union in the early the greatest dam in the world. The dam is twice as 1990’s, the technology was aging and the whole high as Nurek dam, the tallest man made dam in the system was considered inefficient due to the lack of world, also located in Tajikistan. proper maintenance. In 1991, the country achieved its independence from the Soviet Union. This was followed by a five-year civil war that was particularly intense in this semi autonomous region of Gorno Badakhshan. As a result, the isolation of the region was even increased and the installed early warning system became even more TADJIKISTAN obsolete. In 1997, the newly formed government of Tajikistan Dushanbe brought the situation to the attention of the united nation international decade for natural disaster Lake Sarez reduction (UN/IDNDR) secretariat, to “lead an effort to raise international awareness on this problem and develop a program to reduce this threat” (Palmieri 1999). As a result, a UN/IDNDR Interagency Risk Fig. 2. Lake Sarez location map Assessment Mission was sent to Lake Sarez to assess [http://www.acig.org/artman/uploads/map_tajikistan.jpg] the situation. The mission confirmed the inefficiency

of the existing early warning system and Being a non-engineered gigantic dam, the safety of acknowledged the problems caused by the very low Usoi dam is not a recent preoccupation. Until recently, accessibility of the region that prevented the the little information that passed beyond the borders of implementation of a structural heavily engineered the Soviet Union described a “colossal dam of solution to stabilize the dam. The mission also stated questionable stability which retained a vast reservoir the very low probability of occurrence of a major of water. […] Impact projections suggested that a disaster. Yet, due to the high consequences a failure flood could affect roughly 5 million people living would have, the design of an up to date early warning along the Bartang, Panj and Amu Darya rivers, a path system was recommended that would enable the safe evacuation of the nearby population most at risk. traversing Tajikistan, Afghanistan, Uzbekistan and Turkmenistan” (Palmieri 1999) to the Aral Sea, thus In 2000, the Lake Sarez Risk Mitigation Plan (LSRMP) was approved by the World Bank, with the reaching big scale, international proportions. expressed objective of decreasing the “proportion of 2. Chronology of events vulnerable communities in the Bartang and Murgrab valleys with disaster management plans as well as

responsibilities and procedures agreed upon by On February 18th 1911, a gigantic 7.6 magnitude community leaders and villagers, responsible earthquake produced a 2.2 billion cubic meter government authorities and interested non landslide, burying the village of Usoi and blocking the governmental organizations (NGO)” (Palmieri 2000). course of the Murgrab River. The rising water that The four component plan includes technical consulting gradually formed drowned the village of Sarez to form and the installation of an up to date monitoring and a 200-meter deep lake. The water level was ultimately early warning system (component A), social training stabilized in 1914 by the seepage through the dam, and safety related supplies to the local population with the formation of 57 streams that form an (component B), the study of a long term solution important canyon on the downstream side of through intensive monitoring and consulting the dam. (component C), and institutional strengthening and In the following years, several Russian expeditions capacity building of the local government (component were put in place to evaluate the stability of the dam. D). The Swiss Government, the United Sates Agency The information yielded by these expeditions is rare for International Development (USAID), the and mostly unpublished (Schuster 2004). Yet, the Government of Tajikistan, the Aga Khan opinions on the stability of the dam were very diverse, Development Network (AKDN) and a credit of the but the consideration of the high consequences of World Bank, shared the financial burden of the failure obviously conducted the Russians to install a implementation. first early warning system in 1988 (Palmieri 2000).

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/3

The expected five years implementation period was Another efficient prevention strategy would consist in extended by one year. In 2006, the LSRMP was fully evacuating the local population as a preventive implemented. In 2007, the final reports of the measure, and label the area a “non-habitable zone due implementation and the project evaluation were to natural hazards” to discourage further resettlement. completed (Stucky Ltd, World Bank). The mandate of By having the humility of acknowledging the power the implementing agencies expired while the of nature, such solution would display a safe and operational responsibility of the operation and durable solution. Yet the social and political cost the maintenance of the early warning and monitoring displacement of 19’000 rural people that are attached system is now passed over to the Tajik authorities. to their land is immense and difficult to pay. Indeed, Namely, the Usoi Department is formally in charge of the sense of belonging to their homeland is very high the LSRMP, as part of the Ministry of Emergencies among the rural population, and people prefer to stay and Civil Defense, in Dushanbe. in their mountain valleys, despite the remoteness, the lack of supplies and the occurrence of natural hazards 3. Lake Sarez Risk Mitigation Plan (Palmieri 1999). Indeed, the local population has lived with the constant threats of natural hazards for 3.1 Alternatives generations and would certainly not envisage the prospect of leaving. One could argue that the installation of a monitoring Therefore, a monitoring and early warning system can and early warning system to mitigate a risk linked to a be seen as a consensus to an economically and socially natural structure that is expected to fail someday in a feasible short to medium term risk mitigation strategy. near or far time horizon (Papyrin 2007) may seem a Indeed, the local population is deeply attached to their rather shy and non-durable strategy at the very least. lands, displays a remarkably high education and is Indeed, several other more structural proactive ready to accept capacity building for community measures have been considered by Soviet scientists to participation in disaster mitigation and response mitigate the risk in a more ‘durable’ manner. Such (Palmieri 2000). Thus given the remoteness and strategies include “controlled 100-150 m water level inaccessibility of the setting, and given the alertness, drawdown in the lake to eliminate overtopping by high understanding and willingness to respond of the local wave through construction of a tunnel spillway on the population, the installation of an early warning and left bank for irrigation in dry years and power monitoring system has been selected as the most generation” or to “raise the crest of the lowest part of adapted mitigation strategy for the area (Palmieri the dam by moving the boulder material over the 2000) obstruction using construction machinery or by the blast fill method from the exposed scarps located 3.2 Components above” (Zolotarev 1986, quoted in Schuster 2004). Yet, these most ‘rational’ solutions are difficult to Lake Sarez Mitigation Plan is formed of four implement due to the extreme remoteness of the area, complementary components. Although this study will and its difficulty of access (Fig. 3). Indeed, Periotto focus on component A (Early Warning and (2000, quoted in Palmieri 2000) estimates the Monitoring System) that will be described in details in construction cost of the road required to transport the a following section, the three other components are infrastructure needed to realize such project to over here described in order to have a better understanding 300,000$ per kilometer, which compromises heavily of the setting of the project. the economical feasibility of such project. However, Component B consists if the social training and the the component C of the implemented risk mitigation supply of safety related materials to the local strategy includes monitoring and documentation population. It has been implemented by FOCUS toward a possible durable long-term solution that humanitarian assistance, a non-governmental would be economically feasible. organization (NGO) with an extended work experience in the region in the fields of natural hazard relief and mitigation (Palmieri 2000), and funded by USAID and AKDN. The objectives of this component were to “make the early warning system community- based” (World Bank 2007) by raising awareness. This was done by providing information, emergency training and the involvement of the communities in the preparation and supply of safe havens on higher grounds. Despite implementation delays, the World Bank evaluated the implementation of this component as “satisfactory” (El-Hanbali 2007). All the vulnerable communities have been identified, equipped with safe havens and organized into response groups that received disaster mitigation and training groups. Despite some initial concerns about the competency of the implementing NGO to maintain a high quality Fig. 3. Typical bridge in the Bartang Valley [27] preparedness (World Bank mid term review 2003), the current local awareness and implication of the

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/4 population and the long history of engagement of the the Usoi Department has been evaluated as local NGO’s, the sustainability of component B is not satisfactory by the World Bank at the end of the put into question by the World Bank’s final evaluation project implementation phase, yet although the (El-Hanbali 2007). department has shown a capability to mobilize and Component C consists of the study of long term coordinate operation from other government and solutions to mitigate the risk linked to Lake Sarez. The research institutions, this cooperation has not been study is based on the data revealed by the monitoring institutionalized to ensure that it occurs on a regular system provided by component A. Alike component and formalized basis (El-Hanbali 2007). Moreover, a A, component C was financed by the Government of recent decrease in the department’s qualified staff has Switzerland has been implemented by Stucky recently raised some informal concerns on its further Consulting Engineers Ltd (STUCKY), a Swiss performances (Droz 2008, Personal communications). company, under the guidance of an international Panel of Expert (POE), and consists of complementary desk 3.3 Monitoring and Early Warning System studies including the assessment of different routes to the lake, digital terrain modeling, inundation studies, The implementation of a monitoring and early wave generation and propagation in the lake, warning system constituted the first component (A) of mechanism of dam overtopping, sediment the LSRMP. The system was designed by STUCKY accumulation rate, seepage, etc (El-Hanbali 2007). and funded by the Swiss government. Fela Planungs The World Bank has evaluated this component as AG, a Swiss construction company was awarded the “satisfactory”. Although no further field research has supply and installation of the system (FELA Planungs been scheduled for the near future, local experts and AG 2004). The relevance of such a “light” proactive decision makers have been provided with complete mitigation system with such a light structural and up to date data on the situation, towards the design component, as opposed to a heavier and perhaps more of a sustainable long term solution (El-Hanbali 2007). durable engineering solution, in the context of Lake By intending to strengthen local institution to Sarez has been discussed in section 3.1. The purpose efficiently take over the management, operation and is here to accurately describe the key components of maintenance of Lake Sarez Risk Mitigation Plan, the the system, in order to allow a further analysis and awareness of component D is of particular relevance evaluation of the system’s operational robustness to in the evaluation of the reliability of the human the unknown. Where not specified, all the information component of the project. Component D has been of the following section comes from STUCKY implemented by the Government of Tajikistan (GoT) documentation. with funding from both a World Bank credit and the GoT. A new government agency, the Sarez Agency a. Risks assessment (SA) has been created within the Ministry of As part of components A and C, a thorough and Emergency Situation and Civil Defense (MESCD), quantitative analysis of the risks linked to the with the mandate of managing the operation and occurrence of natural hazards on Lake Sarez has been maintenance phase of the LSRMP. Consultancy and conducted by STUCKY. funding from the World Bank has been provided to strengthen this institution in its capability to conduct Fault tree analysis this task. Yet by February 2002, due to unsatisfactory A fault tree analysis was conducted, whereby the financial management, the SA was dismissed and possible causes of a considered threatening event are replaced by the Usoi Department, an existing assessed and analyzed. In the case of Lake Sarez, the department within the MESCD, directed by Mr. considered threatening event is a condensed flash Kadam Maksaev, which operated the original early flood with a flow increase superior to 400 cubic warning system. Ultimately, the capacity building and meters per second, as basic estimates consider events training were thus directed towards the Usoi above this magnitude to cause significant damages on Department, which are currently in charge of the downstream villages. The fault tree is displayed in the operation and maintenance of the early detection and following figure. monitoring system. The performance and capacity of

overtopping ? huge wave landslide yes no external erosion pressure wave earthquake

superficial slide flood >400m3/s global instability

internal internal water level erosion disturbance variation

clogging extreme flooding

Fig. 4. Fault tree Analysis [6]

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/5

The analysis of the several hazards revealed by the The risk of the occurrence of a huge wave can be fault tree is summarized in the following paragraphs, explained by the presence of a massive slowly moving and allowed the emphasis of most probable scenarios. slope instability on the mountain side located on the right bank of the lake. A massive landslide into the Lake level increase lake can thus be triggered by an event such as an A steady lake level increase of about 20cm/yr (Droz earthquake, possibly resulting to a that would 2006) has been noticed. Yet a period of 200 years overtop the dam. The height of the wave is a function would be necessary for the lake level to reach the of the speed and volume of the hypothetical landslide. lowest part of the crest of the dam. Moreover, the A numerical simulation of a worst case scenario given higher permeability of the upper layers of the dam is the parameters as known today yields to a wave expected to allow a higher seepage flow rate and thus overtopping the dam by 50 meters above the lowest slow the level increase. A dam overflow due to the point of its crest, ensuing to a flood of about 800,000 steady increase of lake level is thus considered a very cubic meters (Droz 2006). But the occurrence of a unlikely event (Droz 2006). landslide of 0.5 cubic kilometer of volume at 20 m/s that would be required to overtop de dam is Global Dam Instability considered as very improbable. The stability of the natural dam has also been However, a more thorough characterization of the considered as a possible hazard source. right bank landslide remains to be done. Until the The external erosion due to the formation of springs parameters of the possible landslide are not better on the downstream side of the dam has not been known, the “real possibility of a wave overtopping the considered important enough to compromise the dam and the knowledge of the effects of such a wave global stability of the dam. Yet, obstructions of the on the downstream valley will remain unclear” (Droz Murgrab River due to limited external erosion are 2007).. considered possible (Droz 2007) and may lead to sudden floods. Expert Assessment Meeting The global stability of the dam in the event of an The discussion conducted above has lead to the fact earthquake has also been considered, especially given that a major threatening event such as a flood superior the high seismic activity of the region. Yet, the to 400 cubic meters per second is highly improbable in stability of the slopes is considered as high, as the the absence of a major triggering event such as an expected displacement attributed to high accelerations earthquake. Therefore, the expert assessment meeting, (> 0.4 g) is of the order of 10 cm and considered as which has been conducted to evaluate and quantify the negligible, given the size of the dam. probability of occurrence of the considered flood, is As a result, although the global stability of the dam is based on the probability of occurrence of a major considered as high, sudden flash floods could be earthquake. The several considered chains of event caused by obstructions of the river due to external leading from a major earthquake to the flood and their erosion. asserted probability of occurrence are displayed in the hazard analysis tree given in annex (Droz 2007). Seepage The two most likely scenarios are (a) a piping failure The seepage of the dam has been closely observed as a as a consequence of toe instability of the dam due to potential source of hazard. Yet, the current estimated the pressure wave caused by the earthquake; and (b) low hydraulic gradient and low speed of the seepage, an overtopping wave caused by large landslide into the as well the low turbidity of the outflow water reservoir, triggered by the earthquake. The probability (Schuster 2004) were considered to indicate a low risk of occurrence of both chains of event are in the order linked to the seepage in the current condition of the of 10^-5, which is the probability of failure ordinarily dam. Yet, slow modifications of the seepage regime admitted for man-made dams (Droz 2007). Yet, both are noted and attributed to the geological immaturity of these phenomenon have happened on Italian dams of the dam that will require further monitoring. in the twentieth century, resulting in enormous Clogging hazard is considered as low, due to the high consequences in terms of human lives lost: the tailing number of springs and the heterogeneity of the dam dams of the Stava valley failed due to piping (National material. Piping hazard in the current situation is also Geographic Channel 2008), while the Vajont dam was considered as low, yet an earthquake, or the impact of overtopped by a surge wave (Genevois 2005). a surge wave can cause sudden modifications of the dam’s internal structure. Such a modification could Risk Analysis allow the formation of natural pipes within the dam, However, despite its relatively low probability of which would result in a sudden increase in the occurrence, a flood resulting from a failure of Usoi discharge flow rate and eventual flooding. dam would have high consequences in terms of Thus, although no alarming risk linked to seepage is to damages on the downstream population. Indeed, the expect in the current situation, the evolution of the 19000 inhabitants of the Bartang valley are considered discharge flow rate must be monitored with special as directly exposed, while a total of 132’000 attention in the event of an earthquake or a surge inhabitants of the upper Panj valley are liable to be wave. affected (Palmieri 2000). Between 1000 and 10000 probable casualties are estimated (Droz 2007). Indeed, Overtopping surge wave the existing obsolete warning system would take an

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/6 approximated 54 minutes via television satellite to flash flood, and the high degree of coordination reach Barchidiv, the uppermost village of the valley, needed between the central managing unit in and another estimated two hours to evacuate the Dushanbe, the observation station at the dam, the residents to higher ground (Schuster 2004). Given the remote measuring units and the village alarms fact that a flood triggered by a landslide caused surge (Palmieri 1999). wave (scenario b) would take an approximated 25 to 30 minutes to reach Barchidiv (Zaninetti 2000, quoted Hardware and measuring device by Schuster 2004), the consequences of such an event The early warning and monitoring system include in terms of casualties are sadly obvious. sensors placed in nine distinct monitoring unit (MU), Therefore, despite the low probability of occurrence of scattered on identified sensitive points in the whole a threatening event, the high consequences in terms of area. Measured parameters include flash flood sensors casualties causes the risk to be high. The following downstream of the dam and in the village of figure represents the risk level as compared to the risk Barchidiv, seismic accelerometers on the dam toe and currently accepted by various dam owners. the lake shore, global positioning measuring (GPS) devices scattered on the dam and the right bank unstable slope, pressure cells for lake level and wave height in the vicinity of the dam, flash flood sensors river gauging stations in the village of Barchidiv and automatic weather stations. Moreover, an observation center (dam house, CU) is located on high ground on the left bank near the dam, with visual contact on the right bank landslide. The dam house is to serve as a base for periodical observation expeditions, in particular GPS campaigns and turbidity measurements, as well as a relay in the transmission of the monitored data to Dushanbe, where the Supervisory Control And Data System (SCADA) is located. There, the operational control and management of the system is conducted by the Usoi Department. Finally, the SCADA as well as both the dam house and the village of Barchidiv are equipped with a direct manual alarm trigger. The general layout of the early detection and warning Fig. 5. Risk Diagram [6] system is displayed in figure 6.

Therefore, the objective of the updated early warning Data transmission system is to reduce the risk by reducing the As discussed above, an efficient data transmission consequences of a threatening event. Preliminary system is a requirement to the efficiency of the studied by the World Bank (2000) showed that an warning system. estimated 18’500 casualties would be avoided. The For transmission between the monitoring units and the cost analysis conducted by the same institution central unit at the dam house, a bidirectional very high indicated an estimated cost by statistical life saved of frequency (VHF) radio system is generally applied, 232 US$. The justification for investment would thus with the exception of the two most remote units that be “very strong”, by U.S. Federal Government are linked to the central unit via satellite technology. standard practice in Financing Life-Protective- There, Very Small Aperture Terminal (VSAT) Program (Palmieri 2000). systems are applied using the International Maritime Satellite Organization (Inmarsat D+) network. Bidirectional transmissions between the central unit at b. System design the dam house and the SCADA at Dushanbe are Requirements provided by VSAT technology using the International Following the recommendation issued by the Telecommunications Satellite Consortium (Intelsat IS- UN/IDNDR mission and in the framework of the 704) network. LSRMP, an early warning and monitoring system was Finally, the link between the SCADA, the dam house, designed by STUCKY. The high degree of automation alarms in the villages and the rescue unit in the of the system was required because of the data volume downstream village of Khorog is provided by the to be collected, the inaccessibility of the sensor Inmarsat D+ network, allowing the priority location and the necessity to automatically trigger transmission of alarm signals. Furthermore, the link is alarms for civil protection purpose, once a pre- bidirectional to allow the monitoring of the alarm established values of significant parameters are equipments in the villages to detect quickly a possible detected (Palmieri 2000). Moreover, an emphasis on breakdown. high quality data transmission systems was required due to the obvious time constraint in the event of a

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/7

Fig. 6. Early Warning and Monitoring System situation [7]

Fig. 7. Warning levels chart [7]

Procedures warning level and parameter. Following these The monitoring devices, according to the value procedures, several alarm level could be issued, measured for the considered parameters, can issue five including alarm level 1 (observation) and alarm level 3 warning levels. The different warning levels are (immediate evacuation). The complete procedure list described in the preceding table. is given in annex. Yet several features can be noticed. Once the warning is issued, several procedures are There are three critical parameters suitable to activate prescribed for the SCADA to follow, depending on the a direct automated evacuation alarm in the villages of

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/8 the most exposed zone. These parameters are the implementing agency as part of component D of the detection of a flash flood, the detection of an LSRMP. Specifically, technical, managerial and important and persistent decrease of the discharge organizational training has been given to the staff. As flow rate, or the detection of a surge wave higher than well, operation and maintenance of the system are in 50m. In these cases, no direct human interaction is the responsibility of the UD. It must be noted that the involved in the alarm process, with the exception of a advanced technology of the system, coupled with the proper maintenance of the device. harsh Pamiri environment, makes maintenance In the event of a massive earthquake, no direct especially critical to the sustainability of an efficient interactive measure is prescribed. However, a system (El-Hanbali 2006). The maintenance proactive increase of attention to the evolution of the instructions provided by STUCKY are given in annex. critical parameters mentioned above (level 1 alarm) is One can notice the importance of maintaining the ordered. energy provision source for the monitoring units, Level one warnings on the critical parameters especially the flood sensor unit, which constant mentioned above are “correlated”, as an increased functioning is absolutely critical in the efficiency of observation of the other critical parameters is the system. expected. Although the global performance of the delivered Other, monitoring parameters, considered as less project has been judged “satisfactory” by the World critical that is, less subject to a short-term time Bank (2006), the financing institution has raised constraint, are also considered. An unusual several concerns on the sustainability of the efficiency modification of these parameters also issues a level of the project. one alarm resulting to an increased awareness and The first two concerns are rather institutional and more frequent and directed measurements. financial. First the teamwork capability among the Furthermore, as mentioned above, all data is local agencies is questioned. Indeed, although the transmitted to the central unit by automated implementing agency (the Usoi Department) seems to transmissions. There, the data is transmitted to the have shown a capability to mobilize and coordinate SCADA at Dushanbe, where it is analyzed and where cooperation from other government and research the procedures are applied. However, level three institutions, no formalized scheme exists, and some warnings issued for flashfloods and large waves result concern have been raised on the long term in the direct transmission of a level three alarm sustainability of such informal collaboration. The (evacuation) in the most vulnerable villages (Fig. 8). second concern is raised on the possible lack of sustained long term funding for operation and maintenance, if international funding were to stop. The two last concerns are much more likely to directly compromise the quality and efficiency of the early warning and monitoring system in a much shorter time and are, in my opinion more preoccupying. First, a gradual decrease in the quality of the maintenance of the system is suspected (El-Hanbali 2006). Indeed, the fact is, that although training has been given to produce “Operation and Maintenance” and “Control Yearbook” reports yearly, a slight decrease in the quality of the publication has been lately noticed by experts (Droz 2008, personal communication). Furthermore, the World Bank fears a seasonal decrease in maintenance quality due to the difficulty of access and the harsh environmental conditions. However, as can occur all year long, critical maintenance tasks including periodic battery and solar panel checks must be regularly done. A pertinent illustration on the significance of the task can be found in the maintenance of MU7. MU7 is the measuring unit Fig. 8. Equipped alarm house in a Bartang Valley upstream from Barchidiv that is designed to detect village [27] flash floods and issue level three evacuation warnings to 15 villages (FELA planungs AG 2004). It is thus a 3.4 Current situation and concerns critical component of the system. Yet its control unit, The monitoring and early detection system is at this GPS and solar panel are located on a remote location date fully operational and under local Tajik steering about 100 meters above the river (Fig. 9). Their access and management. As mentioned earlier, the Usoi for maintenance is not a trivial task, especially in Department (UD) as part of the Ministry of winter. Emergency and Civil Defense is the implementing agency. Capacity has been built within the

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/9

rescue missions and the subsistence of the population The decrease of the maintenance quality is a critical in the safe havens. issue that could produce a crisis situation and will thus The quality attributes of the system that are here be taken seriously in this work. considered are serviceability and safety, whereby a Finally, the lack of long-term national technical success is recorded when the villagers could be safely capacity to operate and maintain the system, as well as evacuated with sufficient time notice before the flood the lack of ability to recruit and maintain qualified reaches them. staff during and after the project, cause concerns (El- Hanbali 2006). Indeed, an out-migration phenomenon 2.2 Demand and capacity components of of local skilled staff has been lately noticed through a the safety system significant decrease in the number of personal having The assessment of the reliability of the system received a proper technical and organizational training described above will be based on the system’s that are still working among the Usoi Department capacity to meet the demand in the context of a rapidly (Droz 2008, Personal communication) evolving crisis. The demand on the system is imposed by the natural As a summary, the probable decrease of the quality of hazard scenario that has been attributed most risk by maintenance of the system produces a crisis prone the expert risk assessment of Lake Sarez. This situation, whose interactive management efficiency scenario will be detailed in a further section, but leads may be compromised by the lack of qualified to a high intensity <2000m3/s flash flood of manpower. This paper intends to evaluate and give concentrated mud in the Bartang valley. recommendations to manage this uncomfortable Quantitatively, the demand on the system is defined as situation. the time available to evacuate the population, since the instant the flood could first be detected by the uppermost flood sensor of the system. The capacity of the LSRMP that would be solicited in the occurrence of the demand described above can be separated into two distinct components: first the time needed to detect and signal the threat in the form of an alarm issued by the engineered system; then the time needed to actually evacuated the local population to safe zones. Thus the capacity meets the demand when the sum of the alarm transmission and evacuation times is smaller than the flood travel time. In that light, the reliability of the system in a crisis situation will be analyzed in this paper.

2.3. Crises Fig. 9. Remoteness of MU7 control unit [27] A crisis is defined as a situation, where “improbable events are joined and produce an evolutionary and interactive complexity in the performance of a system” D. Scope and methodology (Bea 2008). In other words, I will define a rapidly evolving crisis as a situation where unexpected events 1. Goal force the system to function beyond the setting it was Given the concerns mentioned in the last paragraph, designed for, under critical time constraints, and the goal of the project is to assess, and evaluate the therefore compromise its ability to perform with the reliability of the early warning and population required quality. Therefore, the mere occurrence of the evacuation systems of Lake Sarez, in the setting of a big scale natural hazard considered in the demand rapidly evolving crisis situation. The focus is set on analysis that will be conducted does not create a crisis the human factor in the interactive management of situation per se, because the system is designed to manage such a situation. The elements actually these crises. triggering a crisis are additional concomitant events 2. Methodology that compromise the ability of the system’s capacity to meet the demand. Three principal types of such events

can be mentioned: 2.1 System definition

The system considered in this paper is the set of - Unexpected “unexpectable” external events procedures, hardware, organization and human attacking the system, typical example of which operators that are taking part in the management and include meteorological surprises. transmission of information. The considered time - Internal malfunctions within the system, which window stretches between the occurrence of the often turn out to be linked to human violations. triggering natural hazard and the completed Such events include ill-performed maintenance evacuation of the villagers to safe havens. I therefore operations chose not to include the management of eventual

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/10

- Finally, instead of abruptly decreasing its - FOCUS humanitarian is the Non Governmental capacity, events may create a crisis by increasing Organization that is in charge of the the demand on the system. In the considered case, implementation and management of the such an event would involve an increase in the evacuation plan of the Bartang valley villages. flood speed. They are thus knowledgeable on the local context and the current behavior of the local population. This paper is focused on human factors by rather Specifically, Mr. Mustafa Karim, the country considering crisis-triggering events from the two first director was motivated to collaborate and categories. provided contact with Mr. Abdulhamid Gayosov for flood routing information and Mr. Rahim 2.4. Evaluation Sources and Validation Balsara for information on local populations. The sources to the analysis presented in this paper can - Dr Robert Bea is an expert in Risk management be classified in two categories: expert assessment and at the University of California at Berkeley and is literature review. The redundancy of the information thus knowledgeable on the methodologies to be between and among these two categories constitutes employed in the analysis and evaluation. the validation system to the analysis. The principal experts that were contacted are mentioned in the The demand is assessed by collecting flood routing following list: data from a combination of reports from the industry (Stucky Ltd 2007, United Nation Mission to Lake - Mr. Patrice Droz is the technical director of Sarez 2001, USAid 2006), of academic literature STUCKY, Ltd, Switzerland. He was in charge of (Schuster 2004), and of personal communication with designing and implementing components A and C knowledgeable experts (Droz, Karim and Gayosov of the LSRMP. Specifically, he has designed the 2008). organization and procedures of the early warning Both components of the capacity were assessed by and monitoring system and is thus knowledgeable generating crisis situations on the basis of concerns on its expected behavior. Mr. Droz was motivated found in the literature (El Hanbali 2007), personal to conduct the analysis. communication (Droz 2008) and in relevant case - Mr. Kadam Maksaev is the head of the Usoi studies (Bea 2008). Each situation was then matched Department, the implementing agency. He is to the specific context of Lake Sarez by being currently in charge of the management of the validated or rejected the experts mentioned above. The LSRMP. Specifically, he is responsible of the experts were then assessed on the probable behavior of operation and maintenance of the early warning the system in the event of the crisis situation that they and monitoring system and manages the SCADA considered most risky in terms of likelihood and and is thus knowledgeable on its current consequences. Subsequently, one or several possible behavior. Although repeatedly contacted, Mr. interactive crisis management strategies were Maskaev did not respond positively to the generated and submitted to the expert’s validation. analysis.

Barchidiv

D a m MU7

Fig. 10. MU7 Flood sensor situation [7]

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/11

II. Analysis

A. Demand after the beginning of the phenomenon. The maximum discharge is expected after 6.5 hours, which 1. Baseline scenario corresponds to 30 minutes after the peak source As discussed above, a crisis only occurs when discharge is reached at the dam. Therefore, if we take unexpected events compromise the ability of the into account the 20km distance separating Barchidiv system to mitigate a given natural hazard with an from Usoi dam, we can expect an average flood 1 expected quality. Thus, in order to define a crisis velocity of 40km/h . situation, a non-crisis situation must be considered. There is therefore a need to consider a baseline natural Water or Water height x Intensity hazard scenario that ought to be correctly managed by height velocity level the system if there were no crisis. In this study, we will consider the natural hazard scenario that is h< 0.5 m h x v < 0.5 m2/s attributed the highest probability of occurrence by the low expert assessment meeting mentioned in an earlier 0.5m < h < 2m 0.5 m < h x v < 2 m2/s medium paragraph.

h > 2 m h x v > 2 m2/s high A massive earthquake of 0.5g acceleration (magnitude 7.3 on the Richter scale) occurs. The ensuing gravity wave causes a structural disturbance at the dam’s toe, Fig. 11. Swiss flood risk intensity standards. [7] modifies significantly its internal structure and disturbs its seepage regime. The disturbance is such 2.2. Flood hazard mapping that as the situation evolves, a piping phenomenon Given the flood routing simulation results, flood occurs within the dam, whereby the pressing water hazard intensity maps have been established by Stucky forms important underground canals. Eventually, after Ltd using the Swiss standards (fig 11), which take into a period of about two hours during which a strong account level and velocity of the flow according to the discharge drop can be observed downstream, these following table. canals reach the downstream slope of the dam and While the two uppermost villages of Barchidiv and cause a flash flood. The created flood is massive and Nisur are partly located in low to medium danger powerful enough (above 400m^3/s surplus to the zones, the hazard intensity mapping expects most of normal discharge, flowing at 40km/h) to cause the other villages of the Bartang valley to be entirely significant consequences on the downstream valley. located in high risk zones for floods from 1000 m3/s. Furthermore, during the piping process the flow Furthermore, considering the fact that in the absence would erode loose deposits and debris forming a of an efficient risk mitigation system, an estimated hyper-concentrated flow, whereby the volume of water 5000 human casualties would add up to the 4000 will progressively represent only one fifth of the total houses, 44 schools 180 community hospitals and flow volume, before being diluted in the confluence 18000ha of cultivated land that would be destroyed in forming the Bartang River. the Bartang valley as a result of such floods from lake Sarez (Droz 2007), the implementation of a risk 2. Flood routing analysis mitigation strategy that would involve the evacuation of all the villages of the Bartang valley is considered a 2.1. Flood routing simulation necessity. Given the hazard described above, a flood routing simulation has been performed by Stucky Ltd, on the basis of which the early warning system and evacuation plan were designed. The objective was to identify the potential effects of a sudden flood in the Bartang Valley due to a sudden discharge from Usoi dam. Therefore two flooding scenarios were simulated, where respectively 1000 m3/s and 5000 m3/s were reached in a period of six hours. These parameters were selected to simulate the timing of a piping event in Usoi dam, where the flood is initiated by a disturbance in the internal structure of the dam following an earthquake. The simulation was conducted using St Venant hydrodynamic equation, Fig. 12. 5000m3/s flood risk intensity mapping of with an approximated terrain roughness of 25 (Droz the village of Barchidiv.[7] 2007). According to simulation results, an increase of 10% in 1 With the assumption that the flood is water. If the fact that the the flow rate can be expected in Barchidev 1.5 hours flow is a hyper concentrated mud flood is taken into account, the velocity is much lower (but the flood much more destructive).

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/12

2.3. Quantifying the demand Assuming it occurs less than two hours after the The parameters resulting from the flood routing detection of the discharge drop, the flash flood is simulation described above enable the estimation of detected by the MU7 flood sensors located about 10 the demand to be met by an effective risk mitigation km upstream of Barchidiv at a flow rate of 400m3/s system in the Bartang valley. The mitigation strategy (level 1 alarm). If a flow rate superior to 2000m3/s is that is considered in the case of lake Sarez being an detected, a level three alarm is automatically given to early warning system coupled to a population both the SCADA and the villages within two minutes. evacuation plan, the demand parameter to be The evacuation alarm is thus triggered in the villages considered is the evacuation time (i.e. the time about 10 to 12 minutes before the flood reaches available to alert and evacuate the population in a Barchidiv. given village). The evacuation time being shortest due If the flashflood had occurred more than two hours to their location, the demand is highest on the villages after the discharge drop is detected, the villages would of Barchidiv and Nisur, which will thus henceforth be already have been evacuated, as a level three considered critical and focused on in this analysis. evacuation alarm is automatically issued by MU9 The village of Barchidiv is located 20 km downstream once the discharge drop has lasted two hours. of lake Sarez and has 30 households with 186 inhabitants. According to the assessment of the geologists a flash flood from Lake Sarez would reach 1.2. Crisis Situations the village in 23 to 27 minutes. In the case of the Several crisis situations with different crisis triggering village of Nisur, a flood would reach the 40 events have been generated on the basis of the households (242 inhabitants) of the village in a period concerns identified in the several documentation of 31 to 38 minutes after being generated at the lake, sources (Droz 2008, El Hanbali 2007, Bea 2008 29 km upstream (Gayosov, quoted by Karim 2008) personal communication). Considering the fact that the uppermost flood detectors (MU7) are located 10 km downstream of the - The first three situations are caused by direct and lake, the time between the moment the detectable indirect (i.e. maintenance related) human flood can be detected by the system and the moment it violations, which can be linked to existing passes in the village of Barchidiv is 12 to 14 minutes, concerns on the future supply of funding and and 16 to19 minutes for the village of Nisur (ibid). trained staff. Therefore, an evacuation time of 12 to 14 minutes in the village of Barchidiv is the critical demand to be Situation 1 considered in the present system. A failure to meet Due to insufficient funding and out-migration of this demand is thus considered as a sufficient failure trained staff, a lack of qualified personal to condition. operate the system on a continuous basis is to be noticed. Therefore, the currently sufficiently B Capacity trained operating personal has to be working on longer shifts, which raises their level of stress and 1. Early Warning System fatigue. The earthquake occurs around 4 a.m, and neither 1.1. Baseline Procedure the SCADA operator, nor the personal at the dam The standard procedure planned for the system to house are able to react appropriately because of manage the baseline scenario described in the stress, fatigue and concomitant health problem preceding section is given as follows (Droz 2007). (say a strong diarrhea). As a result, as the baseline natural hazard scenario unfolds, no The earthquake is detected by the strong motion human operator is available to manage the accelerometers in both the measuring units on the lake system. shore (MU2) and at the dam toe (MU8), and the level 1 warning transmitted by satellite to the SCADA at Situation 2 Dushanbe, which must ask for visual confirmation by Due to the remoteness of their location and their radio at the dam house. Once the earthquake is difficulty of access, no maintenance operation has confirmed, a visual inspection of the dam is conducted been performed on measuring units 7 (flood from the dam house and awareness increased to detect sensor), 4 and 8 (strong motion accelerometers) discharge drops, flash floods and high waves. during the passed winter season that has been The strong discharge drop (of about half the normal particularly harsh. discharge) that could be caused by the dam toe As a result, at the occurrence of the earthquake in instability and the internal disturbance can then be mid April, the flood sensors and strong motion detected by visual observation and/or the river accelerometers have not been accessed for either gauging system located in the village of Barchidiv routine maintenance or testing since mid (MU9), about 16 km downstream from the dam. The October. detection of the discharge drop issues a level one warning to the SCADA at Dushanbe. The awareness is Situation 3 increased towards the detection of a flash flood. Due to the decrease in funding and trained staff, the overall global maintenance quality of the

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/13

system has decreased dramatically, except for the “high”. Both Kadam Maskaev (the current chief alarm unit in the villages that are maintained in a operator) and Patrice Droz (the system designer) were satisfactory way because of the strong local contacted and requested to evaluate the situations, yet involvement. only Mr. Droz responded positively. The following As a result, at the occurrence of the earthquake, section is thus solely based on Mr. Droz’s knowledge no maintenance operation has been performed on of the system and its environment, as its designer. any part of the system (except the village alarm units) for a period of two years. Situation 1 The first situation involves the activation of a level - The next three situations are caused by one emergency at the event of an earthquake. The unpredicted unpredictable events whose procedure does not require any direct and immediate occurrence has a relatively low probability and operator intervention other than a visual evaluation of high consequences. Such features have been the earthquake impacts, conducted at dawn. If a identified to be typical of crisis triggering events flashflood happened to occur before dawn, the (Bea 2008). flashflood detection and alarm system is fully automatic and operational if it is not destroyed during Situation 4 the earthquake (situation 5). Therefore, although the Due to a huge snowstorm, all access to and from probability of an earthquake occurring at night in the the dam house is cut for two weeks at the shift of an insufficiently trained operator is considered occurrence of the earthquake. As well, the outside as “medium”, its consequences are considered “low”. visibility is reduced to less then 10 meters for the whole period. Situations 2, 3 and 5 As a result, no site access or visual observation Situations two, three and five all involve a can be made for a period of two weeks following malfunction of the MU7 flood detector, either because the earthquake. of a lack of maintenance or its destruction during the earthquake. According to the expert, there is a Situation 5 “medium” probability of a physical destruction of A stone triggered by the earthquake MU7 due to consequences of the earthquake (say a destroys completely measuring unit 7. As a result, stone avalanche). Yet, due to the difficulty of access the flood sensors at MU7 are not operational at of the device, combined with a decrease in funding the occurrence of the flash flood. and staff training, the probability of a malfunction of MU7 due to a lack of maintenance (e.g. typically a Situation 6 failure to periodically check its power supply) is A snow avalanche triggered by the earthquake considered “high”. The flood sensors located at MU7 destroys completely the dam house. are the only flash flood detection devices upstream from the village of Barchidiv and are thus a capital - Finally, the two last situations illustrate specific component to the system’s ability to detect the flood concerns found in the literature (Schuster 2004, early enough to allow a safe evacuation of the valley. Papyrin 2007) about the reliability of lake Sarez Yet, there is an advantageous configuration and a early warning and monitoring system. proper correlation with the MU9 measuring unit located in Barchidiv that would allow the evacuation Situation 7 of most of the downstream population. However, if Due to global warming, the hydro geological the proper crisis sequence unfolds, the villages of setting of the area changes. Specifically, the Barchidiv and Nisur may not be evacuated in time. melting of the permafrost and the progressive rise This point will be pursued in a following section. The of the water level due to glacier melting may consequences of a malfunction of MU7 are thus change the natural hazard probabilities and considered as “medium”. decrease the capacity of the early warning system to mitigate the considered scenario. Situation 4 The punctual lack of accessibility/visibility displayed Situation 8 in situation four, although likely to happen in this A massive landslide on the unstable right bank elevated and mountainous area, does not compromise slope is triggered by the earthquake and the direct efficiency of the evacuation alarm triggering generates a massive tsunami that would not system that is entirely automatic and independent on overtop the dam, but destroy the dam house. the need of visual confirmation in the case of a flash flood. Yet it may influence the evacuation time of the population and delay the rescue and/or observation 1.3. Crisis selection missions. This point will be considered in the Validation was sought on the eight situation described assessment of the evacuation plan conducted in a later above by submitting them to an expert’s advice, who section. On a side note, according to Mr. Droz, the was to qualitatively evaluate their probability of only MI8 helicopter in Tajikistan that could transport occurrence and consequences in terms of human lives 15 people crashed in March 2008, which may at risk, by describing them as “low”, “medium” and temporarily complicate the emergency access of the area. The probability of such a situation to occur is

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/14 thus considered “high”, while the consequences on the permafrost and an increase in river water levels due to ability of the system to detect the danger and transmit glacier melting. According to the expert, the melting the alarm is considered “low”. of the permafrost will merely produce superficial landslides and local bloc collapses, but no important Situations 6 and 8 big scale motions. Moreover, a hydrological survey Situations six and eight involve the complete established in 2001 has shown that despite the global destruction of the dam house due. The probability of warming, no important modification in the natural occurrence of situation six, which involves a snow flow rates in the area has occurred in the last couple of avalanches is considered very close to zero by the years. Therefore, despite the quasi certainty of global expert, due to the chosen location of the house. warming, its consequences on the early warning Situation eight involves the destruction of the dam system of lake Sarez are considered “low”. house due to a tsunami in lake Sarez. The very occurrence of the tsunami due to a massive landslide The expert opinion on the effects of the considered of the unstable right bank slope of the lake in the event crises on the system gives important insights on its of an earthquake has been considered “very unlikely” functioning and behavior. The qualitative risk by the panel of experts in charge of the risk survey evaluation linked to each situation is summarized in undertaken in the frame of the LSRMP. Yet if the figure 13. According to the expert, the riskiest tsunami does occur, the destruction of the dam house situations concerning the alarm system are situations is possible. Indeed, the dam house was built by the two, three and five, which all involve a malfunction of Usoi department at an altitude 30 meters lower than the main flood sensors 10 km upstream from advised by Stucky Ltd, to « spare their men ». Barchidiv (MU7). The destruction of MU7 will thus be considered, as the selected crisis-triggering Situation 7 situation to be further investigated in the following Finally, situation seven involves possible effects of section. global warming on the region, namely a melting of the

Situation Description Likelihood Consequences 1 Earthquake at night on unprepared staff Medium Low 2 MU7 Flood sensors (FS) malfunction due to physical inaccessibility High Medium for maintenance 3 MU7 FS malfunction due to long term decrease in maintenance High Medium quality 4 No site access and no visibility High Low 5 MU7 FS destruction resulting from earthquake Medium Medium 6 Dam house destruction resulting from snow avalanche Low Low 7 Global warming effects Medium Low 8 Dam house destruction resulting from a tsunami. Low Low Fig. 13. Expert situation selection.

1.4. Crisis Analysis A malfunction of MU7 having been identified as the measuring unit is equipped with a gauging unit to riskiest crisis triggering situation by the expert detect the significant flow rate decrease that precedes assessment presented above, the aim of the following a piping failure flash food. In addition and similarly to paragraph is to perform a deeper estimation of the MU7, MU9 is equipped with automatic flood sensors. effect of such a crisis on the functionality of the early Therefore, if the flash flood occurs after a period of warning system. two hours of significant flow rate decrease, an A quantitative likelihood estimation of such a crisis evacuation alarm will be automatically triggered by was provided by Mr. Droz (2008). The likelihood of MU9. The response to such a scenario would thus not the destruction of MU7 as a consequence of the be affected by a malfunction of MU7. earthquake was estimated at 5%, while the likelihood Yet if the flash flood occurs within two hours after the of a malfunction due to a like of maintenance was flow rate decreases, it will only be automatically estimated as high as 30%. These crisis likelihoods are detected when a 2000m3/s flow rate passes through high enough to deserve attention. the village of Barchidiv. According to Mr. Droz, the However, the effects of a MU7 malfunction are system has been designed to trigger the evacuation mitigated by the presence of the downstream alarm within 2 minute from the moment the flood is measuring unit MU9. Indeed, MU9 is very likely to be detected. Therefore, in the worst-case scenario where working properly because of its location in the village the flood is only detected at MU9, the evacuation is of Barchidiv, which facilitates its access for triggered about 15 minutes after having passed MU7 maintenance and decreases the probability of it being (i.e. the fifteen minutes needed for the flood to cover destroyed in the aftermath of the earthquake (the the 10 km separating MU7 and MU9. If no proper villages being generally located in “safer” zones). This action were taken to decrease this 15-minute delay in

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/15 detection time, the two uppermost villages of Barchidiv and Nisur would thus be threatened by lack - Act: of sufficient evacuation time. Mr. Droz estimates the The decision taken in the previous step must then consequences of this delay in terms of human lives at be implemented, which in the considered case about ten casualties. This low number can probably be consists in manually triggering from the SCADA explained by the fact that both Barchidiv an Nisur are the level three alarms in the threatened villages. mostly located in low to medium zones, as revealed by the risk mapping established from the flood routing - Observe: simulations. Moreover, according to Mr. Droz the Finally, the results of the crisis management villagers are likely to spontaneously evacuate to the strategies must be observed and the decision safe havens in the event of an earthquake. This taken to reiterate the OODA loop if the crisis is assertion will be investigated in a following section not solved. In the considered case, a feedback and concerning the evacuation plan. visual confirmation from the dam house staff and/or the evacuated population are expected. 1.5. Crisis management suggestion However, given the configuration of the early warning If the OODA loop is successfully conducted, an system, I argue that the baseline delay imposed by evacuation alarm can be given shortly after a automatic reaction time of 15 minutes described above reasonable level one 400m3/s flow rate is detected in can be decreased through a proper management of a Barchidiv, which, depending on the hydrogram of the MU7 malfunction crisis. Such a crisis management flood, would leave enough time for a successful involves a proper implementation by the SCADA evacuation of the threatened villages. operating crew of what Reason (1990, quoted by Bea This management strategy has been submitted to and 2008) describes as an Observe, Orient, Decide and Act validated by M. Droz (2008), with the remark that its (OODA) loop. In the case of the described crisis, an efficient implementation would be compromised in efficient OODA loop can be described as following: the event of a too sharp hydrogram peak.

- Observe: 2. Evacuation plan The goal of the observation phase is to detect the symptoms and abnormal system behavior that 2.1. Baseline performance reveal the presence of a crisis. In the case of the In addition to triggering the alarms, the successful considered crisis, a level 1 (400m3/s) flood alarm mitigation of the natural hazard described in the from MU9 without the corresponding alarm from baseline scenario involves the efficient evacuation of the upstream MU7 flood sensors is a possible the villages. The evacuation plan consists of the rapid early sign of the crisis induced by a malfunction evacuation of the villagers towards equipped and of MU7. maintained safety havens in the nearby mountains once the local alarm sirens are activated. A “Disaster - Orient: Response Team” is nominated in each village to The goal of the orientation phase is to establish organize and coordinate these evacuations, as well as and confirm a causality link between the observed to supply and maintain the safety havens. This is done symptoms and the crisis-triggering event. In the in collaboration with FOCUS humanitarian, a NGO considered crisis, the malfunction of MU7 has to who works to raise awareness and provide training be inferred by the operating crew on the basis of among the communities. In that frame, evacuation the absence of the expected level one warning. exercises have been ordered in April 2007 by the Usoi This inference must ultimately be confirmed by department (the LSRMP management authority) on launching a MU7 (distant) testing protocol. the most critical villages of Barchidiv and Nisur. According to FOCUS humanitarian (Karim 2008, - Decide: Personal correspondence), an evacuation drill was The goal of the decision phase is to take the conducted in the village of Barchidiv on April 19th appropriate decision on the basis of the 2007 and yielded to an evacuation time of ten minutes, previously established causality links. The in which the population of 186 successfully outcomes of an appropriate decision are to transferred to the safety haven located 500 meters mitigate the crisis and return the system to its above the village. On the next day in the down stream normal functioning state. In the considered crisis, village of Nisur, 17 minutes were required for the 242 the decision to evacuate the potentially threatened inhabitants to go through the 800 meters to the villages must be taken. In the uncertainty of the designated safety haven. actual occurrence of a flood (due to the Assuming the alarm is given by the early warning acknowledged malfunction of MU7), such a system within the period of two minutes after the decision may not be easy to take, knowing that a flood is detected at MU7 (see section II.B.1.1), these “false” evacuation would be dangerous, costly evacuation times are hardly within the total of 12 to 14 and could encourage a crying wolf effect (i.e. minutes and 16 to 19 minutes respectively imposed by decrease the awareness of the population in the the flash flood demand on the villages of Barchidiv event of future alarms). On the other side, the and Nisur (see section II.A.2.2) . As a matter of fact, results of a not evacuating the villagers in the in the case of Nisur, the total capacity performance of event of a flash flood are obvious.

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/16

19 minutes (i.e. two minutes detection time and 17 behaviors that could be dangerous under actual minutes evacuation) does not meet the demand. emergency conditions ”. Therefore, even without Furthermore, it is important to mention that the taking into consideration the other aggravating evacuation exercises have been conducted in favorable factors that will be mentioned in the following conditions (at day and in spring) and on a warned paragraph, the surprise factor alone may be of population, to prevent “shocking” and “excessive significant importance and thus a potential crisis stress” on the local population (Karim 2008, personal triggering situation: correspondence). Yet several authors have shown (UrbanikII 2000, Duclos 1987, Aboelata 2004, Situation 1 Graham 1999, Sime 1995, Colonna 2001) that the The baseline scenario occurs without other evacuation time performance of a community is aggravating factor than the fact that the shaped by numerous factors that are not taken into evacuation occurs on a surprise basis. account if the exercise is conducted on a warned population. Such factors include the level of local Remarks awareness and compliance, the isolation of the P() Med The surprise effect can be community in a crisis situation, environmental factors mitigated through the crisis such as night time and bad weather, the lack of mitigation strategy presented sufficient warning time, and the congestion of the further down, where training in evacuation paths due to panic, confusion and the lack the detection of “natural” early of excess capacity of the local communication system. signs will be suggested. Thus, given the limited excess capacity in the Cons High Consequences have been population evacuation performance times, and taking estimated by M Balsara to into account the evacuation performance shaping delay the evacuation time from factors mentioned in the literature, several crisis 10 to 15-17 minutes for situations were generated, whereby the ability of the Barchidiv, and from 17 to 20- evacuation plan to meet the required demand may be 25 minutes for Nisur. compromised due to supplementary unfavorable factors added to the baseline exercise conditions in Fig. 14. Situation 1 which the evacuation time performances were measured. M. Balsara estimated the effect of the surprise factor on the evacuation time shown in figure 14, with the comment that actual trainings or drill 2.2. Crisis situations have not yet been conducted during winter. The According to previous reports (Palmieri 1999) and figures given above are thus rough estimates. Yet thanks to the awareness raising work of local NGO’s the fact is that the (top) values of 14 and 19 among the population, the lack of local knowledge and minutes, imposed by the demand to the villages perceived threat that has been mentioned in the of Barchidiv and Nisur respectively, are literature (Duclos 1987, Colonna 2001) to be key exceeded. evacuation time performance shaping factors are not of concern in the present case. Moreover, the low - The next two situations may be linked to the population density of the area also decreases the surprise factor as well, whereby subsequent panic concern of evacuation delays due to the congestion of or the lack of appropriate and available leadership evacuation routes (Sime 1995). Yet several other may significantly affect the evacuation factors may increase the evacuation time to a point, performance (Duclos 1987, Colonna 2001): where the demand imposed by the flood exceeds the capacity of the system to evacuate the population in Situation 2 time. A general panic situation occurs among the Thus, several potential crisis-triggering situations population, as a result of the simultaneous were generated on the basis the factors that were occurrence of the earthquake and the flood found in the literature to affect the population sirens. evacuation time. The relevance of each scenario was then tested trough their submission to an expert’s judgment. Being the FOCUS humanitarian staff responsible of the LSRMP evacuation plan in the Bartang valley, M Rahim Balsara was here consulted. He was asked to evaluate the likelihood (P()) of each scenario, as and its potential consequences (Cons) in terms of human lives lost. For each situation, his evaluation and comments are given in a following table.

- The first factor is to consider is the surprise effect (Colonna 2001). As Colonna states, “With no announced warning, occupants might demonstrate

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/17

Situation 5 Remarks The baseline scenario occurs at night in a bad P() Low Villagers are used to natural snowstorm. Therefore visibility is reduced below hazards around them and 10 meters in the whole area. would most likely react and respond in a calm manner. Cons Low Once the gravity of situation is Remarks realized, evacuation times P() Low Villagers are familiar with would not be impacted. These the routes and use to severe villagers have received weather conditions. trainings and general education Cons Low Less likely to impact related to these hazards as well evacuation times in a as the alert system. significant manner as they

are familiar with these Fig. 15. Situation 2 routes and weather conditions. Situation 3 Fig. 17. Situation 5 Heavy casualties are to be deplored among the

population due to the earthquake. As a result,

the key members of the disaster response team in - Finally, the criticality of providing a sufficient the villages are missing when the flood sirens warning time for the population to evacuate ring. (Graham 1999, Aboelata 2004) is illustrated in

the two last situations, where the alarm could Remarks simply not be transmitted in time by the early P() Low Members of Community warning system. Support Teams (CST) are spread across the villages Situation 6 and unlike to be all affected The earthquake destroys the alarm stations in the at the same time. villages. As a result, no evacuation alarm is Cons Low Dependence on CST for given. evacuation is fairly low as general community knowledge and trainings are Remarks at play during such crisis. P() Medium These are wireless systems connected via Fig. 15. Situation 3 satellite and less likely to be affected with on the - Furthermore, the obvious effect of external ground damage. environmental factors affecting orientation and Cons High If these systems do fail to movement (Graham 1999, UrbanikII 2000) are function due to unforeseen taken into account in the two next situations: circumstances, the evacuation times will be Situation 4 significantly longer as the The principal evacuation routes are severely warnings would have to damaged due to snow or stone avalanches be from a neighboring triggered by the earthquake. village or actual event itself.

Remarks Fig. 18. Situation 6 P() Medium Evacuation paths are designed to be safe from Situation 7 these natural hazards Due to a decrease of the maintenance quality, during the planning and the alarm equipment in the villages is not in design phase. function Cons High If this were the case, they . would need to pursue alternate routes, which might delay reaching destination.

Fig. 16. Situation 4

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/18

The qualitative risk evaluation linked to each situation Remarks is summarized in figure 20. According to the expert, High System malfunction the highest risk with regard to the evacuation of the probability is high as these population can be found in situations 4, where the equipments are not evacuation routes are made useless as a consequence managed and maintained of the earthquake; as well as situations 6 and 7, where by Committee of the alarm is not given at all. Although the possible Emergency Situation, occurrence of situation 4 would certainly be USOI department and preoccupying enough to deserve further study, this villagers. Due to join paper will focus on the crisis triggered by the failure responsibility its upkeep, of the alarm system to order the evacuation of a regular testing and village. Indeed, the malfunction likelihood of the early functioning warning system (either at MU7 or at the village sirens High In case of system failures, level) has been considered high by both consulted evacuation times will be experts (Droz 2008 and Balsara 2008). The significantly longer. importance of the potential consequences linked to the total absence of flash flood alarm in the Bartang

Fig. 19. Situation 7 Valley being obvious, the risk linked to such a crisis is thus maximal.

Situation Description Likelihood Consequences 1 Evacuation on a surprise basis Med High 2 General Panic Low Low 3 No surviving leadership Low Low 4 Very low visibility and extremely bad weather condition. Low Low 5 Evacuation path destroyed by the earthquake Medium High 6 Village alarm destroyed by the earthquake Medium High 7 Village alarm not in function due to poor maintenance High High

Fig. 20. Expert situation selection.

2.3. Crisis Analysis consists of the evacuation message being transmitted The risk evaluation by expert judgment has yielded to from a neighboring village. selecting the crisis-triggering situation where the Considering the critical villages of Barchidiv and village alarm system does not function properly and Nisur, and given the remoteness of the location and fails to issue the evacuation alarm before the passage the available transportation means, the alarm message of the flash flood. According to the expert, the can be either transmitted via radio communication or likelihood of an alarm malfunction in the village of by a running messenger. These possibilities have been Barchidiv or Nisur is “very high” that is, higher than submitted to the expert’s judgment, in order to 85%. evaluate their probability of occurrence and success. A The malfunction of the village alarm units may result warning is considered successful when the evacuation in the passage of a super concentrated flash flood of the village is completed before the passage of the through unwarned villages. The consequences of such flash flood. a crisis in terms of human casualties are considered Given the distance and lack of quick communication “medium” in Barchidiv but “high” in Nisur. means in an earthquake aftermath, the expert Yet, in a similar manner to the other crisis detailed considered both the occurrence and the success of a earlier in this paper, I argue that a proper interactive warning transmitted via land communication management of the crisis can mitigate the negligible. However, the chances of occurrence and consequences of these malfunctions. Indeed, given the success of a radio warning was considered high (60%- failure of the engineered system to issue the expected 85%). Thus, training the villagers to systematically early warning, the most obvious crisis management transmit the warning signal to the neighboring village strategy would consist in searching for alternate by radio can mitigate the risk linked to a malfunction warning signs. In that light, two possibilities have of the alarm equipment. been identified and considered, in collaboration with However, this strategy is based on the assumption that M. Balsara (2008). the alarm malfunctions are not correlated among the villages. In other words, the occurrence of the Reported alarm malfunction in a given village does not affect the The first strategy has been mentioned by M. Balsara likelihood of it occurring in a neighboring village as as a response to one of the crisis-triggering situation well. Yet the task of maintenance being currently in mentioned in the previous section (situation 6), and the responsibility of the same entity (the Usoi department) in all the villages, a decrease in the

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/19 quality of the maintenance would affect all the variation respectively. Yet, the fact has to be villages equally. The likelihood that a given alarm will noticed that these probabilities depend greatly on not function is thus higher if the surrounding alarms local conditions, mainly environmental, that are also not functioning due to a lack of maintenance. could affect the detection ability of the villagers. Therefore, the reported alarm strategy would only - Orientation: An inference must then be made present itself as an efficient crisis management option between the detected phenomenon and the if the procedure of warning the surrounding villages imminent threat of a flash flood. There is thus by radio is formalized and the villagers subsequently certain likelihood that these events be properly trained. Furthermore, the task and responsibility of interpreted by the Community Support Teams maintaining the village alarm equipment should be (CST) that may then issue an evacuation order. outsourced to the villagers themselves. That would This likelihood is considered “high” (60%-80%) decrease the correlation between the alarm by the expert. malfunctions. Furthermore, attributing the alarm - Decision: The evacuation order from the CST maintenance task to the people these alarms are must then be followed by the villagers, despite designed to save may increase the likelihood that the the absence of a tangible and unmistakable maintenance is properly performed. Finally it would evacuation signal, such as alarm sirens. build local capacity and empower the communities, Therefore, the likelihood must be considered that which was one of the main goals of the project as the whole village will actually follow the stated by the World Bank (see section I.C.3.2 and evacuation order from their fellow villagers of the ElHanbali 2007). CST in the absence of an alarm. According to the expert, an issue can be found here in the absence Direct Observation of formalized back up village alarm system that The second strategy has been suggested by Mr. Droz can be unmistakably activated and followed in (2008), and is based on the capacity of the villagers to case the main system fails to work. Such a system notice and react to the chain of natural events that can be as simple as a village gong, but must be would precede the flashflood in the baseline scenario. implemented. Namely, the life saving strategy consists in the reflex - Action: The evacuation order must then be of the Community Support Teams in the villages to transmitted to the entire population of the village spontaneously take the appropriate decision to conduct in a minimum time and, again, in the absence of a safety evacuation, once a strong earthquake and/or an unmistakable alarm siren. However, if the important flow rate variations in the river are entire population is reached and warned, the observed. chances of a successful evacuation are “very A spontaneous decision of the villagers to evacuated high” (>85%) according to the expert. the village shortly after the occurrence of the - Observation: Finally, the observation of the earthquake, and thus before the occurrence of the flash effectiveness of the intended actions is critical. flood, would remove the negative consequences of a Specifically, in the present case, the piping malfunction of the alarm system by giving more than phenomenon and the resulting flashflood can enough time for a successful journey to the safety occur up to hours after the triggering earthquake. havens. Such a reaction is thus à priori desirable. Therefore, it is critical that the entire population Yet, the occurrence and success of such a spontaneous stays in the safety haven until the risk is decision depends on a chain of likelihoods that frame confirmed to be decreased to an acceptable level. the decision making process within the community. In the same manner as in a preceding paragraph, these If the OODA loop is successfully conducted, the crisis management stages will be here described in the villagers themselves can give an evacuation alarm, frame of Reason’s (1990) OODA loop: shortly after the occurrence of the early phenomenon that is, several hours before the occurrence of the - Observation: The first likelihood concerns the flood. Furthermore, this warning mechanism would be probability of these preceding phenomena to be totally unaffected by any of the hardware malfunction observed and noticed by the villagers, despite of the system that are considered here. Finally, if we possible difficult environmental conditions consider the OODA loops for the two considered resulting in a low visibility. In the considered warning signs (earthquake and flow rate variation) as case, these phenomena are an earthquake and a a parallel configuration of two series system, the strong river flow rate variation. The likelihood overall chances of success estimated by the expert is that the villagers would notice these events is “high”, which provides internal validation to this considered “medium” (40%-60%) and “very strategy. high” (>85%) for the earthquake and the flow rate

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/20

III. Evaluation

A qualitative analysis of the robustness of the Lake of the system with regards to most of the considered Sarez Risk Mitigation Project has been undertaken in crises, due to its configuration, correlation, and local this work. The evaluation of the robustness of the excess capacity. LSRMP consisted in inquiring whether the capacity However, the expert assessments have also revealed depicted by an expert assessment meets the demand in that the probability of occurrence of the riskiest crisis- the considered situations. In the absence of a crisis, the triggering situations are highly dependent on the baseline demand of respectively at least 12 and 16 quality of the maintenance of the infrastructure, for minutes is hardly met by the baseline capacity of both the engineered early warning system and the respectively 12 and 19 minutes in the critical villages population evacuation plan. of Barchidiv and Nisur. However, knowing that most In that light, given (1) the non-negligible level of of both villages are located in relatively safe zone, the natural hazard risk linked to Lake Sarez (fig. 5.) and consequences of this lack of excess capacity have (2) the emitted concerns about the future staffing and been estimated to less than 10 casualties (Droz 2008). funding of the device, as well as the decrease in the As a matter of fact, the LSRMP has lead to a quality of maintenance (section I.C.3.4. and expert significant decrease of the risk in the Bartang valley to assessments), a crisis similar to those that were similar orders of magnitude to those nowadays analyzed in this paper is likely to happen in a medium admitted for engineered dams (Droz 2007), which to far term future, and will thus require an efficient must be considered a success, given the remoteness interactive management as an absolute necessity to and lack of infrastructure of the area. Furthermore, the mitigate its effect. effect of most of the considered potential crisis In that light, the following recommendations are given situations on the reliability of the LSRMP were as possible paths to address this issue and thus limit considered “low” by the experts in terms of likelihood the related risk through an optimized crisis mitigation or consequences. This denotes a sufficient robustness and management strategy.

IV. Recommendations

The following recommendations were issued on the organizations are primarily preoccupied by basis of the theories and methods drafted by Professor failures and crises, at every level of management. Robert Bea, from the University of California at Such state of mind is crucial among the actors Berkeley, in the frame of his research on interactive involved in the LRSMP as it would allow a quick crisis management strategies. The fact that Professor detection and remediation of to-be crises, as well Bea based his work on 500+ accident cases (Bea as a more efficient management of crisis that do 2008) provides itself an external validation to these occur. In that light, the decision of the Usoi recommendations. Additionally, an internal validation department to build the dam house at a lower in the specific context of Lake Sarez will be sought altitude than advised by the expert, in order to through the submission of this paper to the local “spare their men” (Droz 2008) is a preoccupying experts. detail. Indeed, although not necessarily significant in terms of direct risk increase, this A. Proactive measures: towards a High decision may reveal a deeper and preoccupying Reliability Organization mindset, where safety requirements are compromised for an economy of labor, even The term “High Reliability Organization” (HRO) before the operational phase of the project. describes “organization that have operated nearly error free for a long period of time” (Bea 2008). Such 2. Open communication and extensive process organization include nuclear aircraft carriers, nuclear auditing power plants or air traffic control system; the common According to Merry (1998), a safety oriented characteristic of all these systems being the extreme organizational culture involves, among other consequences of the occurrence of a failure. In that things, ongoing safety performance light, and given that its very purpose is to be reliable measurements and openness of communications enough to decrease the risk linked to the Usoi dam, it among the organization. Moreover, extensive is my opinion that the LSRMP should strive towards process auditing procedures and actions are being a HRO as a proactive measure to prevent crises. mentioned by Roberts and Libuser (1993) to be This section is to underline HRO principles (Roberts key aspects of a HRO. Therefore, the absence of 1990, Bea 2008) and characteristics that are motivation of the operating authority to being particularly relevant to the LSRMP. audited is worrying at the very least. Indeed, despite numerous attempts, no relevant answer 1. Safety oriented organizational culture has been obtained from the Usoi department in The first relevant common characteristic of an the frame of this paper. Furthermore, Droz (2008) HRO is to place safety in the very root of their mentioned a decrease in the quality of the safety organizational culture. In other words, such reports issued by the organization, as well as a

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/21

break in the communication between himself (the 1. Early Warning System Crisis Management designer) and the current operators of the project. Orientation 3. Migrating decision making The symptom detection time depends on the Migrating decision making is mentioned by proper functioning system’s sensorial abilities. In Roberts (1989) to be another key aspects of the population evacuation case, these sensorial HRO’s, whereby “authority is pushed to the abilities are the villagers themselves that should lower levels of the organization [and…] decision be able to quickly detect any potential crisis- making responsibility allowed to migrate to the forecasting event in their environment. This goal persons with the most expertise to make the is achieved by raising awareness through training decision when the situation arises (employee and education. Fortunately, risk assessments empowerment)” (Bea 2008). Yet in the case of (Droz 2007) have revealed that the trigger of any the Usoi Department, most operational decisions natural hazard linked to lake Sarez is quasi concerning maintenance (Droz 2008), real time certainly a strong earthquake, which is rather management of crisis (Droz 2008) and training obvious to detect. In the early warning system (Karim 2008) are mentioned to be left to the case, the detection time is influenced by the flood person of Kadam Maksaev, the Usoi Department sensors at MU9, as well as the ability of the head, in Dushanbe2. Such a decision making system to properly transmit the alarm to the scheme does not optimally attribute the available operations head quarters in Dushanbe, by cognitive resources of the organization to avoiding unfavorably correlated components (e.g. improve its safety and may eventually components in a parallel configuration that are compromise its ability to efficiently resolve an threatened by the same source of risk). In the case occurring crisis. of lake Sarez this is done by the low correlation between the probabilities of failure of MU7 and Therefore, maintaining the reliability of the LSRMP MU9 (section II.B.1.5). However, the symptom on a long-term basis would involve a deep shift in the detection time also depends on an uncontrollable organizational mindset towards a safety oriented characteristic of the “demand” on the system that culture; a better adequacy between the decision is, the morphology of the flood. Indeed, the making power, responsibility and relevant knowledge; strategy would only be effective for progressively and the restoration of communications with potentially increasing floods (Droz 2008), where sufficient useful parties outside the organization coupled to an time is available between the passage of a inside motivation to be audited. In addition, several 400m3/s flow rate to a level three 2000m3/s flow. other HRO characteristics can be mentioned to If the flood is too sudden, the available time is too potentially contribute decrease the likelihood and small to effectively perform the prescribed optimize the management of crisis in the case of the actions. LSRMP. Such principles include a commitment to resilience, an emphasis on training and selection, and Orientation a proper management of incentives, all of which will The orientation mainly depends on the ability of be further described in the following paragraphs. the involved persons to decode the symptoms and to attribute them to their actual cause or B. Interactive measures: staying awake! consequence. This ability is described by Weick (1995) as “sensemaking” and is a critical The importance of properly managing the real time component to the success of the crisis occurrence of a crisis, in order to limit its effect and management strategy. restore a proper functioning of the system, has been described at the beginning of this paper (section I.B). Decision Considering the selected crisis situations, a specific The decision time depends on the ability of the interactive crisis management strategy has been responsible persons to apply a clear and efficient suggested to minimize the time required to both warn decision protocol, whereby the decision power in and evacuate the population (sections II.B.1.5. and a given situation is attributed to the person that II.B.2.3.). The goal of the following section is to has the highest ability to do so. This ability browse the OODA loops of both strategies to identify includes a deep enough experience and and summarize the main system characteristics knowledge of the specific field context to be able necessary to their successful application. to appropriately visualize the situation (i.e. “to step in the victim’s shoes”), the ability to generate and browse numerous decision options, and the ability to instantly be aware of the probable consequences of any possible solution. In other words, the decision maker must be able to exercise empathy, improvisation and mental simulation in a stressful and time limited situation 2 Which is located at more than 400km of the Bartang Valley. (Bea 2008). Furthermore, the unique available helicopter has been reported to crash in May 2008 (Droz 2008).

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/22

Action sensibilization campaign in schools. Yet, most Finally the minimization of the action time importantly, it must keep the people aware to requires a good coordination within the natural signs by helping to avoid the pitfalls of communities and the crisis management crew, as overconfidence (i.e. being “used” to the danger”) well as the system’s ability to efficiently transmit and over reliance on the technology alone. information to the village alarms. - Orientation: Most importantly, one can notice the fact that the The level of awareness of the Community success of the described strategy mainly depends on Support Teams (CST) and their ability to interpret human and organizational qualities. It is thus once and react to early natural signs must be insured again the responsibility of the top management through frequent checks and testing by an outside authority to sustain these qualities by nourishing a implementing agency (e.g. FOCUS humanitarian safety-based organizational culture that includes what or the Usoi Department). Weick et al (1998) call a “commitment to resilience”, which involves a formal support for improvisation in - Decision: crisis situations, and the active building of the The capacity of the CST to decide an evacuation organization’s cognitive resource. Yet this critical in a crisis situation must be tested and improved aspect can only be achieved through an adapted through frequent and targeted drills and trainings. training and selection policy. - Action: 2. Population Evacuation Crisis Management Finally, a formalized protocol must be established The strategy suggested in the following section is to alert the whole village in case of an alarm based on the several strategies suggested by or malfunction. Such protocol may involve the use submitted to the several experts (M. Balsara and Mr. of traditional alarm techniques such as a village Droz). It consists on achieving the goal of successfully gong. evacuating in time the villages at risk through a triple barrier strategy. The first barrier is a spontaneous 2.2. Village alarm system evacuation of the population at the observation of The second barrier relies on the capability of the early natural early warning signs. If the villagers fail to warning system to detect the danger and trigger the evacuate, the second barrier is the siren alarm from the village alarm systems. This system and the linked engineered warning system. If the villagers still fail to crisis management strategies and OODA loop have evacuate due to a system malfunction, the last barrier been described and detailed in sections II.B.1 and is a reported alarm via radio communication from a IV.B.1. neighboring village. 2.3. Reported alarms 2.1. Spontaneous evacuation Finally, in the failure of the first two evacuation Despite the presence of an engineered early warning strategies, the ultimate crisis management strategy on system, I argue that the success of the evacuation plan which to rely is the collaboration within the villagers must primarily be based on the evacuees themselves in a crisis situation (Balsara 2008). This solution was for the following reasons. First I believe that their own investigated in section II.B.2.3. The two principal security must ultimately rely on themselves, rather elements that were found to potentially increase the than on an unfamiliar and complex engineered system success likelihood of the strategy are here that is far from being infallible, as the experts’ summarized: assessment have shown. Secondly, I believe that the - The reflex to warn the neighboring villagers at extremely valuable capital of local knowledge among the reception of a level 3 evacuation alarm must the villagers must be used to mitigate their risk be formalized and specifically trained. Moreover, situation. Finally, the possibility that the villagers villages must be provided with efficient and would spontaneously evacuate towards the mountains adapted inter villages alarm transmission means. at the occurrence of an important natural hazard has Such means may include radio communications, been considered by both MM. Droz and Balsara. but also more traditional device (e.g. visual Therefore, the training and formalization of such a signalizations) in the event of a radio process through the OODA loop described in section malfunction. II.B.2.3 is likely to increase its effectiveness. Each - Finally, responsibility to maintain the elements of step of this loop is considered in the following the early warning system that are within the paragraph, with specific implementation village must be given to the villagers. In addition recommendations that were issued in collaboration to decreasing the correlation between the village with M Balsara. alarm systems, it would perhaps increase the maintenance level of the device, given the - Observation: villagers’ obvious motivation rationale. Finally, it The local capacity to observe and notice specific would increase the building and valuation of local early warning signs may be increased through capacity. training. According to M Balsara, it is important that the training be adapted to the multiple layers of the population, including a children

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/23

C. Training: Empowering the human factor accelerate the OODA loop sequence. In the case of the early warning system, such a training may Maintaining awareness, sensemaking, empathy, for example be focused on the need to manually improvisation, mental simulation and coordination trigger the evacuation alarm, in a high cognitive capabilities among the human components of the stress situation, as soon as a major irreversible system requires the active sustaining of the malfunctions are detected and confirmed (such as organization’s cognitive resources, which must the malfunction of MU7 in the example started involve training and selection. Moreover, both above). Furthermore, training to unexpected consulted experts have mentioned training and proper situations should as well be applied to the local staff selection as the key measures needed to sustain population, through unexpected evacuation drills the long-term reliability of the system. Therefore, and the constitution of the reflex to flee towards although the consulted experts have shown no the safety haven in any serious suspicion of a concerns about the current preparedness level of both threatening event. The National Fire Protection the operators and the local population, the following Agency (NFPA) Life Safety Code (2001) states section aims at addressing the issue of training as “the that “Fire is always unexpected. If the drill is cognitive skills developed for crisis management always held in the same way at the same time, it degrade rapidly if they are not maintained and used” loses much of its value. When, for some reason (Bea 2008). According to Bea (2008), training should during an actual fire, it is not possible to follow occur on three levels: the usual routine of the emergency egress and relocation drill to which occupants have be- come 1. Normal Situations accustomed, confusion and panic might ensue. First, the operators should be trained to the Drills should be carefully planned to simulate system’s normal operation in order to encompass actual fire conditions.” Due to their unexpected all the commonly performed tasks in their skill nature and the necessity to evacuate the based cognitive realm. In other words the result population in a limited given time, crises on the of the training must be a skill-based performance LSRMP must be managed in a similar manner. of the routine tasks that does not mobilize Yet, as mentioned in a previous section, the Usoi excessive cognitive resources. In the case of Lake Department applies the policy of not conducting Sarez training to normal operations involves the any unexpected evacuation drill on the local proper maintenance of the hardware and population, exactly to avoid the panic and infrastructure parts of the system, including the emotion that would sharpen the cognitive critical maintenance operations on the flood capabilities of the people. sensors and the village alarm stations. In addition to the crucial important of training, a 2. Abnormal Situations proper staff selection process is capital to insure that Secondly, operators should be trained to handle the key powers and responsibilities are attributed to abnormal situations, whereby known but the right persons that is, people possessing the needed unexpected threatening event are imposed. Such skills and cognitive potential be efficient in a crisis training is to enable the prescribed restoration situation. Yet such a selection can only be undertaken procedure to be performed on a rule-based basis, if the proper incentives are put in place to attract and rather than on a slower knowledge base. As a conserve sufficient talented and trained staff. result, the OODA loop reasoning, which would occur in the management of an untrained situation, could be cut short to bypass the orientation and decision stages3. In the case of the early warning system, such training rule would include the direct launch of a testing procedure on MU7 if a level one-flood alarm were solely given by MU9. Furthermore, the local population can be trained toward a rule based reasoning through multiple evacuation exercises and the perfect knowledge of the evacuation procedure.

3. Unbelievable situations Finally, training to extraordinary “unknown unknowable” situations is crucial to sharpen the people’s cognitive skills to properly react when a real time crisis management is needed. Indeed, being exposed to unexpected and unexpectable situations is the best manner to train the cognitive qualities that are mentioned above to be critical to

3 The procedure to adopt is directly linked to the observed symptoms by a trained rule.

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/24

V. Conclusion

A qualitative assessment of the reliability of the Lake implementation of reliability measures on a long term Sarez Mitigation Project (LSRMP) and its interactive perspective are rather positive (i.e. rewards rather than management has been conducted on the basis of punishment) and financial, and often rooted in a expert judgment. As a result, several potential societal increase of safety and reliability requirements. improvement areas have been put into light and Therefore the most efficient incentives to provide long recommendation issued. Such recommendations term funding sustainability to the LSRMP are likely to include the need to strive towards High Reliability be found in the realms of international commerce and Organization (HRO) standards through a safety politics, and public opinion. Thus, a proper financial, oriented organizational culture, improved social and economical study of the long-term communication and decision making migration; the sustainability of this endeavor is yet to be undertaken. need to improve interactive crisis management skills such as maintaining awareness, sensemaking, As a conclusion, the main principle to bear in mind at empathy, improvisation, mental simulation and the end of this paper is the importance of not relying coordination capabilities; and the key importance of on the technology alone, while managing risk training, in normal, abnormal and extraordinary mitigating system. Technology can fail, especially in conditions. the given context of hard environmental conditions and decreasing maintenance quality. As a matter of The goal of a qualitative assessment was precisely and fact, technology will fail; and by failing will create a voluntarily limited to putting into light these potential crisis if no robust measure is taken. Moreover, the improvement areas, yet a quantitative analysis of the engineered alarm system shows in the present case a situation would be the logical next step in the very limited excess capacity of a few minutes at most, documentation of the reliability of the LSRMP in a while the first signs of the phenomenon (i.e. the crisis situation. However, more theoretical and earthquake) occurred and could thus be interpreted methodological research would be then required, with a time margin of several hours. Therefore, notably on a way to quantify the effect of cognitive although the engineered early warning system is here and environmental factors (such as surprise and night clearly necessary to insure the warning of the entire time) on the evacuation time of the villages. valley with an acceptable probability in the occurrence of a flashflood, the potential effect of the grass-root Furthermore, it is evident that the implementation of capacity building of the communities through all the suggested measures will only effectively take intensive training and education, and based on their place if the proper framework of incentives and means extensive knowledge of the location, must not be is provided. Indeed, incentive is needed at the neglected and may result to a complementary and operators level to prevent the exodus of skilled staff much more efficient means to mitigate the risk in the and thus insure the durability of the system through a Bartang valley. proper maintenance, as well as sustained cognitive skills and local knowledge. Furthermore, incentive is Finally and more fundamentally, it is mostly important needed at the management level to effectively promote to acknowledge the impossibility to entirely control a safety oriented organizational culture and lead the the risk linked to natural phenomenon of the organization towards a HRO. Finally, incentive is magnitude of Lake Sarez; and perhaps someday, in the needed at the government level to provide the project dreaded event of a catastrophe aftermath, have the with sufficient means to achieve its goals, once humility to consider the perhaps only reasonable international funding has ceased. Although a variety reactive approach to a near miss, given the amplitude of means can be thought of to create incentives based of the consequences: withdrawing, migrating and on the human wants and needs, including regulation resettling, and ultimately leave to Mother Nature the and peer recognition, the incentives that have been conclusion of the story. recognized to be most efficient (Bea 2008) in the

VI. Acknowledgement

The author is indebted to the consulted experts, Mr. Patrice Droz (Stucky Ltd, Switzerland) and Mr. Rahim Balsara (FOCUS Humanitarian, Tajikistan) for their priceless contribution. Furthermore, Ms Michèle Itten, Mr. Mustafa Karim, Dr Robert Bea and Mr. Rune Storesund are acknowledged for their precious advice.

.

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/25

VII. References

[1] Aboelata M et al, “Transportation model for evacuation in estimating dam failure life loss”, Proceedings of the Australian Committee on Large Dams Conference, 2004

[2] Balsara R, Personal Correspondence, April 2008

[3] Bea R, “Human and Organizational Factors: Quality and Reliability of Engineered Systems”, Course Reader, 2008

[4] Bea R, “Managing the Unpredictable”, ASME feature article, 2008

[5] Colonna G, “Introduction to Employee Fire and Life Safety », National Fire Protection Association, 2001

[6] Droz P, “Abstracts Related to Risk Mitigation of the LSRMP Final Report”, Stucky Ltd, 2007

[7] Droz P, Spasic-Gril L, “Lake Sarez Risk Mitigation Project: A Global Risk Analysis”, ICOLD, 2006

[8] Droz P, Personal Correspondence, February - April 2008

[9] Duclos P et al, “Community evacuation following a chlorine release”, Mississippi Disasters 11 (4) , 1987

[10] El- Hanbali U, “Implementation Completion and Results Report […] for Lake Sarez Risk Mitigation Project”, World Bank, 2007

[11] “Flood on Stava valley”, Seconds from Disaster, National Geographic Channel, 2008

[12] Gayosov A, Personal Correspondence, April 2008

[13] Genevois R, Ghirotti M, “The 1963 Vaiont Landslide”, Giornale di Geologia Applicata., 2005

[14] Graham W, “A Simple Procedure for Estimating Loss of Life from Dam Failure”, Dam Safety Office, US Bureau of Reclamation, 2001

[15] Karim M, Personal Correspondence, April 2008

[16] Merry M, “Assessing the Safety Culture of an Organization”, J.Safety and Reliability Society, 1998

[17] Life Safety Code, National Fire Protection Association, 2001

[18] Palmieri A, “Project Upraisal Document […] for Lake Sarez Risk Mitigation Project”, World Bank, 2000

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/26

[19] Palmieri A, “UN/IDNDR Interagency Risk Assessment Mission to Lake Sarez”, UN- OCHA 1999

[20] Papyrin L, “Myths on Lake Sarez risk mitigation and realities”, Ferghana information agency, 2007

[21] Risley et al, “Usoi Dam Wave Overtopping and Flood Routing in the Bartang and Panj Rivers, Tajikistan”, USAid Water Resource Investigation Report, 2006

[22] Roberts K, “New Challenges in Organizational Research:High Reliability Organizations”, Industrial Crisis Quaterly, 1989

[23] Roberts K and Libuster C, “From Bophal to Banking, Organizational Design Can Mitigate Risk”, Organizational Dynamics, 1993

[24] Roberts K, “Some Characteristics of High Reliability Organizations”, Organization Science, 1990

[25] Schuster R, Alford D, “Usoi Landslide Dam and Lake Sarez, Pamir Mountains, Tajikistan”, Environmental Engineering and Geoscience, 2004

[26] Sime J, “Crowd Psychology and Engineering”, Safety Science, 1995

[27] “Tajikistan Lake Sarez”, Fela Planungs AG Website, 2004 (http://www.fela.ch) (02/25/08)

[28] UrbanikII T, “Evacuation time estimates for nuclear power plants”, Journal of Hazardous Materials , 2000

[29] Weick K, “Sensemaking in Organizations”, Thousand Oaks, CA:Sage

[30] Weick K et al, “Organizing for High Reliability: Processes of Collective Mindfulness”, Research in Organizational Behavior, 1998.

Human & Organization Factors: Quality & Reliability of Engineered Systems-Term Project CE290/2008/27