<<

109

Chapter 7 Challenges of Management in Always-On Enterprise Systems

Mladen Varga University of Zagreb, Croatia

Abstract in always-on enterprise information systems is an important function that must be governed, that is, planned, supervised, and controlled. According to Data Management Association, data management is the development, execution, and supervision of plans, policies, programs, and practices that control, protect, deliver, and enhance the value of data and information assets. The challenges of successful data management are numerous and vary from technological to conceptual and managerial. The purpose of this chapter is to consider some of the most challenging aspects of data management, whether they are classified as data continuity aspects (e.g., data availability, data protection, data in- tegrity, ), data improvement aspects (e.g., coping with data overload and , , , data ownership/stewardship, data privacy, data visualization) or data management aspect (e.g., ), and to consider the means of taking care of them.

Introduction base. In other words: if it is not registered by data, it has not happened. In everyday business we may notice two important • Data is registered in digital form: The ma- data characteristics: jority of important business data is registered in digital form, e.g. the sales data collected at • Every business is an information business: point of sale, the transaction data at automat- All business processes and important events ed teller machine, etc. They are all memo- are registered by data and, eventually, stored rized in digital form in various data bases. in an enterprise ’s data The underlying task of an enterprise information DOI: 10.4018/978-1-60566-723-2.ch007 system (EIS) is to link processes on the operational,

Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Challenges of Data Management in Always-On Enterprise Information Systems

management and decision-making level so as to • Awareness is knowing what is going on: improve performance efficiency, support good Awareness level can be determined by an- quality management and increase decision-making swering these questions: Do end users see reliability (Brumec, 1997). The EIS’s , the right information at the right time? Is specifically its organization and functionality, play the information easily accessible to the a critical role for the functional use of data and right people? information assets in an organization. • Flexibility is the ability to respond ap- Data management involves activities linked propriately to expected changes in busi- with the handling of all organization’s data as ness conditions. information resource. The Data Management • Adaptability is the ability to respond ap- Association (DAMA) Data Management Body of propriately to unexpected change: Does Knowledge (DAMA, 2008) defines data manage- the structure of business data promote or ment as “the development, execution and supervi- prevent flexibility and adaptability is the sion of plans, policies, programs and practices that key question regarding adaptability and control, protect, deliver and enhance the value of flexibility. data and information assets.” • Productivity is the ability to operate ef- Always-on EIS supports business agility which fectively and efficiently: It is important to is the ability to make quick decisions and take establish whether or not business data in- actions. “Agility is the ability of an organization crease the efficiency and effectiveness of to sense environmental change and respond ef- business operations and decisions. ficiently and effectively to that change” (Gartner, 2006, p. 2). The measurable features that enable Effective data management in an EIS is es- a business system to increase the agility of its sential to fulfil these tasks. This chapter considers performance can be defined as follows: various aspects, possibly not all, of data manage- ment that seem to be important for running an

Table 1. Data management aspects and challenges

Aspect Challenge Data continuity Data availability Is data available all the time? Is data integrity compromised? Data security Is data secure enough? Data improvement Data overload Can we cope with all that data? Data integration How to integrate data? Could the data integrate the business? Do we know organization’s data? Do we know how to use organization’s data? Data quality Do we know what quality data is? Are we aware that the quality of data is degradable? Data ownership/stewardship Who is the owner/steward of the data? Data privacy Is data privacy a concern? Data visualization Could we amplify cognition of data? Data management Data management Do we need data governance?

110 Challenges of Data Management in Always-On Enterprise Information Systems

EIS. Table 1 shows the considered aspects and Data availability is the degree of availability of challenges classified as data continuity aspects, data upon demand through application, service or data improvement aspects, and data management other functionality. Data availability importance aspects. varies among applications. It is more important if The challenges of successful data management an organization depends on data to provide busi- vary from technological to conceptual. Techno- ness service to its customers. Data availability logical aspects of data management help business is always measured at an applications’ end and continuity by making EIS data available, complete, by end users. When data is unavailable the users consistent, correct and secure. The aspects ad- experience the downtime of their applications. dressing the problem of EIS data continuity are Downtime leads to lost business productivity, lost described in the section on data continuity. revenue and it ruins customer base. The total cost Once the technological aspects are solved, of downtime is not easy to determine. Direct cost, business may run smoothly. Nevertheless, the such as lost revenue and penalties when service business may run even better if it is innovated con- level agreement (SLA) objectives are not met, can stantly and persistently. Conceptual aspects deal be quantified but the effects of bad publicity and with important data management issues that can customer dissatisfaction are hard to determine. improve business data: coping with data overload, Setting up a data availability strategy may help solving data integration problems, improving data to define the aims and procedures to achieve the quality, assigning data ownership/stewardship, required high data availability. The steps of data maintaining data privacy, and improving insight availability strategy involve determining data into data. In fact, these aspects address the issue availability requirements, planning, designing of EIS data improvement and are dealt with in and implementing required data availability. Data the section on data improvement. availability requirements analysis begins with a rigorous business impact analysis that identifies the critical business processes in the organization. Data Continuity Processes may be classified into a few classes:

Data Availability • Processes with the most stringent high data availability requirements, with recovery Challenge: Is Data Available time objective (RTO) and recovery point All the Time? objective (RPO) close to zero, with the sys- tem supporting them on a continuous ba- Enterprises use their EIS based on information sis, such as internet banking. RTO (Oracle, technology (IT) infrastructure to increase pro- 2008; IBM, 2008) is maximum amount of cess productivity, empower users to make faster time that a business process can be down and more informed decisions, and consequently before the organization begins suffering provide a competitive advantage. As a result, the unacceptable consequences, such as finan- users are highly dependent on EIS. If a critical EIS cial loss. RPO is the maximum amount of application and its data become unavailable, the data a business process may lose before entire business can be threatened by loss of cus- severe harm to organization results. RPO tomers and revenue. Building of EIS that ensures is a measure for data-loss tolerance of a high data availability is critical to the success of business process. It is measured in terms of enterprises in today’s economy. time, for example one hour .

111 Challenges of Data Management in Always-On Enterprise Information Systems

• Processes with relaxed data availability , non-volatile storage, solid state and RTO and RPO requirements, with the disks, and tape devices. system that does not need to support ex- Modern IT architecture, referred to as cluster tremely high data availability, such as sup- or grid computing, has a potential to achieve data ply chain. availability. Cluster/grid computing architecture • Processes that do not have rigorous data effectively pools large numbers of servers and availability requirements, such as internal storage devices into a flexible computing resource. organization’s business processes. Low-cost blade servers, small and inexpensive multiprocessor servers, modular storage technolo- The challenge in designing a highly available gies, and open source operating systems such as IT infrastructure is to determine all causes of Linux are raw components of this architecture. application downtime. Causes of downtime may High hw/sw reliability may be achieved by be planned, such as system or database changes, using many features. To give some indication on adding or removing processor or node in cluster, them and not intending to diminish the features changing hardware configuration, upgrading or of other vendors, here is a non-exhaustive list of patching software, planned changes of logical or Oracle’s features (Oracle, 2008): physical data structure; and unplanned, such as computer and storage failure, in • Efficient storage management, database, human errors etc. such as Oracle’s Automatic Storage Important factors that influence data avail- Management: The feature spreads data- ability are: base files across all available storage and simplifies the database files management. • Reliability of hardware and software • Redundant storage with high availabil- components of the EIS: hw/sw reliability ity and disaster-recovery solution that • Recoverability from failure if it occurs: provides fast failover, such as Oracle’s data protection Data Guard: The feature maintains stand- • Timely availability problem detection: by as transactionally consistent data problem detection copies of the primary or production data- base. When the primary database becomes Hw/Sw Reliability unavailable the feature switches any stand- by database to the primary status, minimiz- Storage management is very important because ing the downtime. It may be used with tra- data resides on storage media or storage devices. ditional and restore. Disk drives or various disk subsystems, such as • Fine replication, such as Oracle’s RAID (Redundant Array of Independent Disks) or Streams: The feature includes fine-grained JBOD (Just Bunch Of Disks), are dominant stor- replication, many-to-one replication, data age media. Although disk drives are very reliable transformation etc. (Mean Time Before Failure - MTBF is almost one • Cluster or grid management, such as million hours) their mechanical nature makes them Oracle’s Real Application Clusters and the most vulnerable part of a computer system. If Clusterware: The features allow the data- a storage system includes hundreds or thousands base to run any application across a set of of disk drives, data availability problem becomes clustered servers. This provides a high lev- very severe. Secondary storage media for backup el of availability and scalability. In the case and archive purposes include removable storage, that a server fails the database continues to

112 Challenges of Data Management in Always-On Enterprise Information Systems

run on the surviving servers. Also, a new undo extensive logical errors. It includes server can be added online without inter- a fast database point-in-time “rewind” rupting database operations. capability, historical viewing and quick recovery. Data Protection Continuous data protection (CDP), also called The main aim of data protection is to recover data continuous backup or real-time backup, offers some from failure if it occurs. High data availability and advantages over traditional data protection. CDP data vitality can only be ensured using a compre- data is continuously backuped by saving a copy of hensive and reliable database backup and recovery every data change, in contrast to traditional backup strategy, and the ways to locate media loss or cor- where the database is backuped at discrete points ruption. Again, some of Oracle’s (Oracle, 2008) of time. In CDP, when data is written to disk, it is features essential to data protection are: also asynchronously written to a second (backup) location over the network, usually another com- • Backup and recovery management, such puter. Essentially, CDP preserves a record of every as Oracle’s features Secure Backup, and transaction and captures every version of the Manager: The backup feature that the user saves. This eliminates the need for must include all modern techniques of lo- scheduled . It allows restoring data to any cal or remote tape devices backup, either on point in time, while traditional backup can only calendar based scheduling or on demand. restore data to the point at which the backup was Data may be encrypted as well. Database re- taken. Furthermore, CDP may use multiple methods covery is a very demanding job and a good for capturing continuous changes of data, providing recovery management tool is absolutely fine granularity of restorable objects. In the grow- necessary. Such a tool may be designed to ing CDP products market the leaders are, among determine an efficient method of executing others, IBM Tivoli Continuous Data Protection for backup, restoration or recovery, for example Files, Microsoft System Center Data Protection online or offline backup of the whole data- Manager, and Symantec Backup Exec. base or its constituent parts (files, blocks), In fact, CDP does complete journaling, which fast incremental backups (only changed is essential for high availability (Schmidt, 2006). blocks are backuped), and incrementally up- Journaling is a property of a database system where dated backups (on-disk image copy backups changes are first written to a non-volatile area are rolled forward in-place using incremen- before the database I/O operation has finished. tal backups), automatic recovery to control Nevertheless, journaling may be restricted by ag- or in time points, etc. gressive cashing techniques where write operation • Recovery intelligence tool, such as places changes in the RAM and cache management Oracle’s Data Recovery Advisor: The eventually writes the data to disk. Thus, DBMS tool which is able to automatically diag- that resides on a database server, shown in Fig. nose disk data failures, present repair op- 1 (Schmidt, 2006), employs persistent storage tions and run them. for actual data (data store), cashing for needed • Logical recovery features, such as performance and persistent storage for auxiliary Oracle’s Flashback Technologies: The data such as logs or journaling. feature may analyze and recover data on A relational database supports transactions as the row and table level and do fine granu- the basic units of works. A transaction consists of a lar repair, or rewind the entire database to series of low-level operations that are all success-

113 Challenges of Data Management in Always-On Enterprise Information Systems

Figure 1. DBMS and the types of storage

fully executed or not executed at all. Transactions and logical level. Physical integrity means physical possess ACID properties: Atomicity (transaction protection from malicious, accidental or natural is either done completely or not at all), Consis- damage. Logical integrity is preserved if only the tency (the database does not violate any integrity permitted combination of data is registered in the constraints when a transaction begins and when data base, in keeping with the data description in it ends), Isolation (every transaction behaves as the repository. if it accesses the database alone) and Durability Data integrity is in an always-on EIS envi- (the successful transaction results are not lost, ronment, as important as data availability. Data but remain). In the high-availability environment integrity can be compromised in a number of durability requires special attention. It may be ways. Storage media failure, including cache reduced during disaster recovery. and controller failures, may cause corruption of data that can be more dangerous than a failure Data Problem Detection that causes storage to be offline. Corrupted data may be copied during backup operation to backup High data availability can only be achieved as a media without being detected. Consequently, it is result of preventive action and early data prob- extremely important to check data integrity and lem detection. It is advisable to use some of the detect data corruption before it is too late. intelligent tools that automatically diagnose data Human and logical errors are always possible. failures, inform users about them, and assess For example, a user or application may erroneously the impact of the failure, such as Oracle’s Data modify or delete data. In contrast to physical media Recovery Advisor. corruption, logical errors are difficult to isolate because the database may continue to run without Data Integrity any alerts or errors. Focused repair techniques are thus necessary to combat logical errors. Challenge: Is Data Integrity Repair technology may use the same features Compromised? that are used in data protection. Again, some of Oracle’s features (Oracle, 2008) are: Protecting data integrity means ensuring that data is complete, consistent and correct both on physical

114 Challenges of Data Management in Always-On Enterprise Information Systems

• Intelligent tool that automatically diagno- authorization. Availability means that data must ses data failures, presents recovery options, be available when it is needed. High availability and executes recovery, such as Oracle’s systems aim to remain available at all times, pre- Data Recovery Advisor. venting service disruptions due to power outages, • Backup and recovery management, such hardware failures, and system upgrades. as Oracle’s features Secure Backup, and Data Protection Act is used to ensure that Recovery Manager. personal data is accessible only to the individuals • Logical recovery features, such as Oracle’s concerned. Data should be owned or stewarded so Flashback Technologies. that it is clear whose responsibility it is to protect • Auditing tool, such as Oracle’s Logminer, and control access to data. In the security system which can be used to locate changes in the where the user is given privileges to access and database, to analyse data in the database, maintain an organization’s data, the principle of and to provide undo operation to rollback least privilege should never be overlooked. This logical data corruptions or user errors? principle requires that a user, i.e. user’s programs or processes, is granted only the privileges necessary Data Security to perform tasks successfully. Organizations must regularly update their user’s privileges database, Challenge: Is Data Secure Enough? especially when the user is moved to a new job or leaves the organization. Data security must ensure that access to data is Due to high probability that security threats controlled and that data is kept safe from corrup- will appear during EIS lifecycle it is absolutely tion. The broader and more commonly used term necessary to establish a security policy, develop is information security, which is the process of and implement security plans and programs, and protecting information, and information systems, manage security risks. Many security techniques from unauthorized access, use, disclosure, destruc- are standardized. International Standardization tion, modification, disruption or distribution (Al- Organization (ISO), the world’s largest developer len, J. H., 2001). Protection of confidential data, of standards, published the following standards: and information, is in many cases a legal issue (e.g. ISO-15443: Information technology - Security bank accounts data), ethical issue (e.g. informa- techniques - A framework for IT security as- tion about personal behaviour of individuals), and surance; ISO-17799: Information technology business issue (e.g. business enterprise customers’ - Security techniques - Code of practice for in- data). For individuals, information security has a formation security management; ISO-20000: significant effect on privacy, which is described Information technology - Service management; in the section on data privacy. and ISO-27001: Information technology - Security The key features of information security are techniques - Information security management confidentiality, integrity and availability. Confi- systems. Professional knowledge may be certified dentiality means that the system must not present as Certified Information Systems Security Profes- data to an unauthorized user or system. Confi- sional (CISSP), Information Systems Security dentiality is usually enforced by encrypting Professional (ISSAP), Information during transmission, by limiting the number of Systems Security Engineering Professional (IS- places where data might appear, and by restricting SEP), Information Systems Security Management access to the places where data is stored. Integ- Professional (ISSMP), and Certified Information rity means that data cannot be modified without Security Manager (CISM).

115 Challenges of Data Management in Always-On Enterprise Information Systems

Data Improvement in time) or even from a single sensor at a given moment. In the latter case several experts process Data Overload: Exploding Data the input (fusion on experts). As a response to the question “Can we ever Challenge: Can We Cope escape from data overload?” Wood, Paterson with All that Data? & Roth (2002) suggest that the problem can be resolved by cognitive activities involved in ex- (IDS, 2008) gives the forecast of worldwide in- tracting meaning from data. They argue “that our formation growth through 2011. It estimates that situation seems paradoxical: more and more data the digital universe in 2007 numbers 2.25 x 1021 is available, but our ability to interpret what is , and by 2011 it will be 10 times the size it available has not increased” (p 23). We are aware was in 2006. Approximately 70% of the digital that having greater access to data is a great benefit, universe is created by individuals, but enterprises but the flood of data challenges our ability to find are responsible for the security, privacy, reliability, what is informative and meaningful. and compliance of 87% of the digital universe. Reduction or filtration of data does not seem The amount of data content is increasing to the to solve the problem of data overload. Data which point that we individually and collectively suffer is evaluated as unimportant and non-informative from data overload. and thus eliminated may turn out to be critically Experiments and experience in decision mak- important and informative in another particular ing processes show that data overload negatively situation. Ironically, if the benefit of technology is impacts decision performance. Decision makers increased access to data, reduction of data discards face three problems: some of accessed data. Data are informative by “relationship to other • problem of timeliness: high volumes of data, relationships to larger frames of reference, data constrain them if they are performing and relationships to the interests and expecta- both sense making and decision making tion of the observer. “Making data meaningful tasks, always requires cognitive work to put the datum • problem of throughput: high volumes of interest into the context of related data and of data overwhelm and distract them dur- issues” (Wood, Paterson & Roth, 2002, p. 32). ing sense making tasks, causing “analysis Therefore, all approaches to overcome data paralysis” and lowering the overall perfor- overload must involve some kind of selectivity. mance, and Between positive or negative selectivity we must • problem of focus: decision makers have a choose positive. “Positive selectivity facilitates or finite amount of attention that may be dis- enhances processing of a portion of the whole” tributed across tasks. (Wood, Paterson & Roth, 2002, p. 32). Negative selectivity or filtering, which is commonly used, is an example of a technique to inhibits processing of non-selected areas. “The cope with data overload. By means of fusion, critical criterion for processes of selection, paral- different sources of information are combined to lel to human competence, is that observer need to improve the performances of a system (Hall & remain sensitive to non-selected parts in order to Mcmullen, 2004). The most obvious illustration shift focus fluently as circumstances change or to of fusion is the use of various sensors, typically to recover from missteps” (Wood, Paterson & Roth, detect a target. The different inputs may originate 2002, p. 32). In conclusion, data overload can from a single sensor at different moments (fusion only be effectively resolved using a positive form

116 Challenges of Data Management in Always-On Enterprise Information Systems

Figure 2. approach of selectivity and techniques that support focus shifting over the field of data. Consequently, in order to overcome data overload, data processing should create context sensitivity rather than insist of data finesse.

Data Integration

In the eighties of the 20th century we all hoped for a single integrated enterprise’s database that would be based on a stable schema. These hopes have never been realised in practice and prob- ably never will be. Data integration techniques are gaining increased importance as a result. oriented on and decision support, Data integration, in a broad sense, is a process of is shown in Fig 2. Information is extracted from integrating mutually related data which resides at source databases A, B and C to be transformed and autonomous and heterogeneous sources, and pro- loaded into the data warehouse D by the process viding a unified view of integrated data through a known as ETL. Each of data sources A, B or C unified global schema (Halevy, 2001). The need of has its own and unique schema. After extracting data integration increases with the overall need to data from A, B and C, data may be transformed share existing data. The exploding abundance of according to business needs, loaded into D and data and new business needs requires the data to queried with a single schema. The data from A, be integrated in order to extract new information B and C are tightly coupled in D, which is ideal and gain new knowledge. Data integration aims for querying purposes. to answer complex questions, such as how some Nevertheless, the freshness of data in D con- market factors influence the marketing strategy stitutes a problem as the propagation of updated of certain products. From the managerial view, data from the original data source to D takes some data integration is known as Enterprise Informa- time, called latency time. As business becomes tion Integration. more real-time, the system that supports it needs to be more real-time. The challenge is how to Challenge: How to Integrate Data? make a real-time data warehouse. Here are some techniques for making data warehouse more or Although the data integration problem has been less real-time (Langseth, 2004): theoretically extensively considered both in busi- ness and science, many problems still remain to be • “Near real-time” ETL: The oldest and solved in practice. Nonetheless, we may recognize easiest approach is to execute the ETL pro- three basic types of data integration approach: cess again. For some applications, increas- federation approach, warehouse approach and ing the frequency of the existing data load mediator approach. may be sufficient. In federation approach each of data sources • Direct trickle feed: This is a true real-time talks independently through wrapper to other data data warehouse where the data warehouse sources in order to be mutually “integrated”. is continuously fed with new data from Data warehouse approach, which is frequently data sources. This is done either by directly and successfully used in many commercial systems

117 Challenges of Data Management in Always-On Enterprise Information Systems

inserting or updating fact tables in the data data warehouse feeds back the transactional sys- warehouse, or by inserting data into sepa- tem in order to drive and optimize transactional rate fact tables in a real-time partition. processing. Thus, the data warehouse is active • Trickle and feed: The data is continuously if it automatically delivers information to the fed into staging tables and not directly into transactional system. the actual data warehouse tables. Staging Recent trends towards mediator approach try to tables are in the same format as the data loosen the coupling between various data sources warehouse actual tables. They contain and thus to avoid the problem of replicated data and either a copy of all the data or a copy of delayed update in the data warehouse architecture. the data of current period of time, such as This approach (Fig. 3) uses a virtual database day. At a given period the staging table is with mediated schema and wrapper, i.e. adapter, swapped with the actual table so the data which translates incoming queries and outgoing warehouse becomes instantly up-to-date. answers. A wrapper wraps (Ullman, 1997) an This may be done by changing the view information source and models the source using definition where the updated table is used a source schema. The situation emerges when instead the old one. two independent data sources have to be merged • External real-time cache: this is a variant or combined into one source, such as integration of trickle and feed where the real-time data of two similar databases when two organizations are stored outside the data warehouse in an merge or integration of two document bases with external real-time cache, avoiding poten- similar structure content. tial performance problems and leaving the The users of the integrated system, i.e. data data warehouse largely intact. source D, are separated from the details of the data sources A, B and C at the schema level, by Full integration of the data in the business specifying a mediated or global schema. The medi- sense is achieved by the warehouse approach ated schema is a reconciled view of data in sources supported by an active data warehouse. The ac- A, B and C, so the user needs to understand the tive data warehouse represents a single, canonical structure and semantics of the mediated schema in state of the business, i.e. a single version of the order to make a meaningful query (Lu, 2006). The truth. It represents a closed loop process between task of the data integration system is to isolate the the transactional (operational) system and data user from the knowledge of where data are, how warehousing (analytical) system. The transac- they are structured at the sources, and how they tional system feeds the data warehouse, and the are reconciled into the mediated schema.

Figure 3. Mediator approach

118 Challenges of Data Management in Always-On Enterprise Information Systems

The source databases are approachable through Intensional inconsistencies, often referred to as a wrapper code which transforms the original semantic inconsistencies, appear when the sources query into a specialized query over the original are in different data models, or have different data databases A, B or C. This is a view based query schemas, or the data is represented in different because each of the data sources A, B or C can be natural languages or in different measures. For considered to be a view over the virtual database example: “Is 35 degrees measured in Fahrenheit D. The first step in the approach involves setting or in Centigrade?” this type of inconsistencies up the mediated schema either manually or au- may be resolved by the usage of ontologies which tomatically (Rahm & Bernstein, 2001). The next explicitly define schema terms and thus help to step involves specifying its relationships to the resolve semantic conflicts. This is also called local schemas in either Global As View (GAV) ontology based data integration. or Local As View (LAV) fashion. Extensional inconsistencies, often referred to In the GAV or global-centric approach the as data inconsistencies, are factual discrepancies mediated or global schema is expressed in terms among the data sources in data values that belong of the data sources, i.e. to each data element of to the same objects. Extensional inconsistencies the mediated schema, a view over the data sources are visible only after intensional inconsistencies must be associated. In the GAV approach mediator are resolved. (Motro & Anokhin, 2006, p.177) processes queries into steps executed at sources. argue that “all data are not equal”, and “that data Changes in data sources require revising of the environment is not egalitarian, with each informa- global schema and mapping between the global tion source having the same qualification”. In a schema and source schemas. Thus, GAV is sensi- diverse environment “the quality” of information tive on scalability of global model. provider’s data is not equal. Many questions arise, In the LAV or source-centric approach the me- such as: Is data enough recent or is outdated? Is diated or global schema is specified independently the data source trustworthy? Is data expensive? from the sources but the sources are defined as An interesting approach is suggested by (Motro views over the mediated schema. LAV is better & Anokhin, 2006). In Internet era when the at scalability because the changes in data sources number of alternative sources of information for require adjusting only a description of the source most applications increases, users often evaluate view. Nevertheless, query processing in LAV is information about data sources. The meta data, more difficult. Some approaches try to combine such as timestamp, cost, accuracy, availability, the best of GAV and LAV approach (Xu & Embley, and clearance, whether provided by the sources 2004; Cali, De Giacomo & Lenzerini, 2001). themselves or by a third-party dedicated to the The query language approach propagates ex- ranking of information sources, may help users tending query languages with powerful constructs judge the suitability of data from the individual to facilitate integration without the creation of the data source. mediated or global schema. SchemaSQL (Laks- manan, Sadri & Subramanian, 2001) and UDM Challenge: Could the Data (Lu, 2006) are examples of this approach. They Integrate the Business? serve as the uniform query interface (UQI) which allows users to write queries with only partial A business process is a structured, measured set knowledge of the “implied” global schema. of activities designed to produce a specific output An issue of growing interest in data integra- for a particular customer or market (Davenport, tion is the problem of elimination of informa- 1993). Even though most organizations have a tion conflicts among data sources that integrate. functional structure, examining of business pro-

119 Challenges of Data Management in Always-On Enterprise Information Systems

cesses provides a more authentic picture of the ERP is often supplemented by program way business is run. modules for analytical data processing which is A good information system relies on Enter- characteristic for data warehousing and decision prise Resource Planning system (ERP) which support systems. As a result it is perceived as an uses a multitude of interrelated program modules Enterprise Information system (EIS) or simply an that process data in individual functional areas. Enterprise System (ES) that represents complete In Figure 4, ERP is exemplified by three basic operational and analytical program solutions. functions: supply, manufacturing and sales. They According to business process approach, some stretch from the top to the bottom of the pyramid. typical process lines may be used: If a program module covers the whole function, it runs through all management levels: operational, • Financial and Business Performance tactical and strategic. Management: a set of processes that help

Figure 4. Enterprise Resource Planning (ERP) system

120 Challenges of Data Management in Always-On Enterprise Information Systems

organizations optimize their business or fi- example, only operational sales data exist, nancial performance whereas sales data for the tactical and stra- • Customer Relationship Management: a tegic level of decision-making are missing set of processes utilized to provide a high due to the lack of appropriate sales reports level of customer care or if there is no possibility to interactively • Supply Chain and Operations: a set of analyse sales data. processes involved in moving a product or • Insufficient horizontal integration of dif- service from supplier to customer ferent functional areas results when func- tional areas are not integrated. In case that A vertically integrated information system con- production data are not visible in sales nects activities on the lowest level of a particular application, or they need to be re-entered function (e.g. a process of transactional retailing) (although they already exist in production with data analysis and data representation on application), operational sales data will not the decision-making level (e.g. reports on sales be well integrated with operational produc- analysis for the general manager). tion data. A horizontally integrated information system • Finally, if data in an information system enables a systematic monitoring of specific busi- are not sufficiently integrated, the informa- ness process from end to end. For instance, as tion system will be comprised of isolated soon as a purchase order arrives, an integrated information (data) islands. An information information system can receive it, forward it “au- system is not integrated unless individual tomatically” to sales and delivery process which functional areas are mutually integrated will deliver the goods to the customer and send the with other functional areas by data. For in- invoice to his information system, open a credit stance, sales data may not be integrated i.e. account; record the quantity of goods delivered connected with production data; or they in warehouse evidence; in case that goods first may be integrated in indirect and compli- need to be produced in a manufacturing plant the cated ways. Problem occurs because some system will issue a work order to manufacture the older applications were designed for indi- required quantity of goods; the production process vidual functional areas without concern for can be supplied by a production plan; sales results other functional areas. will be visible to the sales manager and analysed using analytical tools of the information system; Challenge: Do We Know etc. An integrated information system enables Organization’s Data? recording of all business events so that the data can be used an analysed effectively throughout If we do not know organization’s data well enough, the organization. we will use them rarely and poorly. Users often Let us mention some problems related to do not know what information is available and, integrated data: as a consequence, they do not use the information system frequently enough. Users should be briefed • If the vertical coverage of data from a about the information system and the ways to use particular functional area is insufficient, it. A (central) data catalogue should be provided data does not comprise all levels of this to present the complete data repertoire. functional area. This is the case when, for

121 Challenges of Data Management in Always-On Enterprise Information Systems

Challenge: Do We Know How to set up a data governance function whose role to Use Organization’s Data? it is to be responsible for data quality.

If we do not know how to use the information Challenge: Are We Aware that the system well enough, we will not use it sufficiently Quality of Data is Degradable? and we will not exploit all of its capabilities. Even if users know that the organization has data, they Many databases behave as growing organism. may not know how to use it. Therefore, the infor- Data is added, updated or deleted. Due to new mation system should be documented and ways of applications the structure of a database may be using it described in a catalogue of functions. altered to meet new business requirements. During the database lifetime many alternations keep the Data Quality database functional. But over time, with continuing alterations to its structure and content, the logical Challenge: Do We Know integrity of the database can be so degraded that What the Quality Data is? it becomes unreliable. It often happens that data integration is negatively affected by the lack of (Juran & Godfrey, 1999) defines data to be of high proper documentation regarding changes, the fact quality if the data is fit for their intended uses in that the people responsible for the application have operations, decision making and planning, i.e. moved on, or that the vendor of the application no if it correctly represents the real world to which longer exists. In many cases the organization does it refers. Among various categories of attributes not realize that the database has been degraded describing data quality the most commonly used until a major event, such as data integration, oc- are accuracy (degree of conformity of the data curs. It is time to determine the true quality of to the actual or true value to which it refers), data in the database. completeness (if nothing needs to be added to it), (Healy, 2005) suggests that the only reliable relevance (the data is pertinent and adequate to the way to determine the true nature of data is through context and the person who use it) and timeliness a data audit, involving thorough data analysis (the data is given in timely manner). Naturally, it or profiling. There are manual and automated is desirable that the data supported by EIS fulfil methods of data profiling. In manual data profil- the mentioned quality characteristics. ing analysts assess data quality using the existing (Bocij, Chaffey, Greasley & Hickie, 2006) documentation and assumptions about the data gives a few additional lists of categories of data to anticipate what data problems are likely to be quality concerning time dimension (timeliness, encountered. Instead of evaluating all data they currency, frequency, time period), content dimen- take data samples and analyze them. If such an sion (accuracy, relevance, completeness, concise- analysis discovers wrong data, the appropriate ness, scope), form dimension (clarity, detail, order, programs have to be written to correct the prob- presentation, media) etc. However, the end user lems. The manual procedure can be a very costly of data requires quality data in quantities that and time-consuming process requiring a number support the decision-making process. Too much of iterations. The more efficient and accurate way data can be a burden for the person who needs to of examining data is to use an automated data make a decision. profiling tool that is able to examine all the data. Most companies view data quality control as a Data profiling tools can identify many problems cost problem, but in reality it can drive revenue. resident in the data and provide a good picture of The problem of data quality drives the companies data structure including metadata descriptions,

122 Challenges of Data Management in Always-On Enterprise Information Systems

data values, formats, frequencies, and data pat- Data Privacy terns. Automated data profiling solutions offer to correct data errors and anomalies more easily Challenge: Is Data Privacy a Concern? and efficiently. Data privacy is important wherever personally Data Ownership/Stewardship identifiable information (in digital or other form) is collected and stored. Data privacy issues arise Challenge: Who is the Owner/ in various business domains, such as healthcare Steward of the Data? records, financial records and transactions, crimi- nal justice records, residence records, ethnicity Data ownership refers to both the possession of and data, etc. Information that are sensitive and need responsibility for data. Ownership implies power to be confidential are, for example, information over and control of data. Data control includes about person’s financial assets and financial the ability to access, create, modify, package, transactions. sell or remove data, as well as the right to assign The challenge in data privacy is to share privileges to others. This definition prob- while protecting personally identifiable informa- ably implies that the database administrator is the tion. Since freedom of information is propagated “owner” of all enterprise data. Nevertheless, this is at the same time, promoting both data privacy not true and must not be true. The right approach and data openness might seem to be in conflict. to ownership aims to find the user who will be Data privacy issue is especially challenging in responsible for the quality of the data. the internet environment. Search engines and According to (Scofield, 1998), telling a user techniques can collect and combine that he/she “owns” some enterprise data is a personal data from various sources, thus revealing dangerous thing. The user may exercise the “own- a great deal about an individual. The sharing of ership” to inhibit the sharing of data around the personally identifiable information about users enterprise. Thus, the term “data stewardship” is is another concern. better, implying a broader responsibility where the Data protection acts regulate the collecting, user must consider the consequences of changing storing, modifying and deleting of information “his/her” data. Data stewardship may be broken which identifies living individuals, sometimes into several stewardship roles. For example, known as data subjects. This information is re- definition stewardship is responsible to clearly ferred to as personal data. The legal protection of define the meaning of data and may be shared the right to privacy in general, and of data privacy between analyst, database administrator and key in particular, varies greatly in various countries. user. Access stewardship is responsible to permit The most regulated are data privacy rights in the or prevent access to enterprise’s data. Quality European Union (EU), more restrictively than in stewardship is responsible to fulfil the broad term the United States of America. Data protection in “quality data” and may be shared between analyst EU is governed by Data protection Directive 95/46/ and key user. Thus, data stewardship may be seen EC (EU, 1995), where personal data is defined as distributed responsibility, shared between IT as “any information relating to an identified or people (analyst, database administrator) and key identifiable natural person”. Personal data must users. Responsibilities assigned to all - be, according to the Directive “(a) processed fairly ship actors, i.e. data stewards, must be registered and lawfully; (b) collected for specified, explicit in a data catalogue. and legitimate purposes and not further processed

123 Challenges of Data Management in Always-On Enterprise Information Systems

in a way incompatible with those purposes; (c) bases in textual, numerical or multimedia form. adequate, relevant and not excessive in relation to The results of data analysis of the large data sets the purposes for which they are collected and/or are not acceptable and attractive to the end users further processed; (d) accurate and, where neces- without good techniques of interpretation and sary, kept up to date; every reasonable step must visualization of information. be taken to ensure that data which are inaccurate Information visualization is a rapidly advanc- or incomplete, having regard to the purposes for ing field of study both in academic research which they were collected or for which they are and practical applications. (Card, Mackinlay further processed, are erased or rectified; and (e) & Shneiderman, 1999, p.8) define information kept in a form which permits identification of visualization, as “the use of computer-supported, data subjects for no longer than is necessary for interactive, visual representations of abstract data the purposes for which the data were collected or to amplify cognition. Cognition is the acquisition for which they are further processed.” or use of knowledge. The purpose of the informa- Privacy and data protection are particularly tion visualization is insight, not picture. The goal challenging in multiple-party collaboration. If of the insight is discovery, decision making and heterogeneous information systems with differ- explanation. Information visualization is useful to ing privacy rules are interconnected and infor- the extent that it increases our ability to perform mation is shared, privacy policy rules must be this and other cognitive activities.” mutually communicated and enforced. Example Information visualization exploits the ex- of a platform for communicating privacy policy ceptional ability of human brain to effectively in the web-services environment is Web Ser- process visual representations. It aims to explore vices Privacy (WS-Privacy). P3P (W3C, 2008) large amounts of abstract, usually numeric, data is a system for making Web site privacy policies to derive new insights or simply make the stored machine-readable. P3P enables Web sites to data more interpretable. Thus, the purpose of translate their privacy practices into a standard- data visualization is to communicate information ized, machine-readable XML format that can be clearly and effectively through graphical means, retrieved automatically and easily interpreted by so that information can be easily interpreted and a user’s browser. Translation can be performed then used. manually or with automated tools. Eventually, the There are many specialized techniques de- Web site is able to automatically inform P3P user signed to make various kinds of visualization agents of site privacy practices in both machine- using graphics or animation. Various techniques and human-readable formats. On the user side, are used for turning data into information by us- P3P user agent may read P3P site’s privacy policy ing the capacity of the human brain to visually and, if appropriate, automate decision-making recognize data patterns. Based on the scope of based on this practice. visualization, there are different approaches to data visualization. According to (VisualLiteracy, Data Visualization 2008) the visualization types can be structured into seven groups: sketches (“Sketches are at- Challenge: Could We Amplify mospheric and help quickly visualize a concept. Cognition of Data? They present key features, support reasoning and arguing, and allow room for own interpretation. Many data acquisition devices, such as various In business sketches can be used to draw on flip measure instruments, scanners etc. generate large charts and to explain complex concepts to a client data sets, which data is collected in large data- in a client meeting.”), diagrams (“Diagramming is

124 Challenges of Data Management in Always-On Enterprise Information Systems

the precise, abstract and focused representation of • Data Governance: planning, supervision numeric or non-numeric relationships at times us- and control over data management and use ing predefined graphic formats and/or categories. • Data Architecture Management: as an An example of a diagram is a Cartesian coordi- integral part of the enterprise architecture nate system, a management matrix, or a network. • Data Development: analysis, design, build- Diagrams explain causal relationships, reduce the ing, testing, deployment and maintenance complexity to key issues, structure, and display • Database Operations Management: sup- relationships.”), images (“Images are representa- port for structured physical data assets tions that can visualize impression, expression or • Data Security Management: ensuring realism. An image can be a photograph, a computer privacy, confidentiality and appropriate rendering, a painting, or another format. Images access catch the attention, inspire, address emotions, • Reference & : improve recall, and initiate discussions.”), maps managing versions and replicas (“Maps represent individual elements (e.g., roads) • Data Warehousing & Business in a global context (e.g., a city). Maps illustrate Intelligence Management: enabling ac- both overview and details, relationships among cess to decision support data for reporting items, they structure information through spatial and analysis alignment and allow zoom-ins and easy access • Document & Content Management: to information.”), objects (“Objects exploit the storing, protecting, indexing and enabling third dimension and are haptic. They help attract access to data found in unstructured sourc- recipients (e.g., a physical dinosaur in a science es (electronic files and physical records) museum), support learning through constant • Meta Data Management: integrating, presence, and allow the integration of digital in- controlling and delivering meta data terfaces.”), interactive visualizations (“Interactive • Data Quality Management: defining, visualizations are computer-based visualizations monitoring and improving data quality that allow users to access, control, combine, and manipulate different types of information or media. Using DAMA’s functional framework Interactive visualizations help catch the attention (DAMA, 2008) each function may be described of people, enable interactive collaboration across by seven environment elements: goal and prin- time and space and make it possible to represent ciples of the function, activities to be done in the and explore complex data, or to create new in- function (i.e. planning, control, development and sights.”), and stories (“Stories and mental images operational activities), deliverables produced by are imaginary and non-physical visualizations. the function, roles and responsibilities in the func- Creating mental images happens trough envision- tion, practices and procedures used to perform the ing.”). Each of mentioned visualization types has function, technology supporting the function; and its specific area of application. organization and culture. A similar term is enterprise information man- agement. (Gartner, 2005) defines it as an organiza- Data Management tional commitment to define, secure and improve the accuracy and integrity of information assets DAMA’s functional framework (DAMA, 2008) and to resolve semantic inconsistencies across all suggests and guides management initiatives to boundaries to support the technical, operational implement and improve data management through and business objectives of the company’s enter- ten functions: prise architecture strategy.

125 Challenges of Data Management in Always-On Enterprise Information Systems

Challenge: Do We Need a future state and designing an implementation Data Governance? plan to achieve the future state. A data governance project may be difficult, often with the tough task According to DAMA (DAMA, 2008), the begin- to change the culture of an organization. Therefore, ning and central data management activity is data a phased approach should be applied. Each step governance which is exercise of authority, control must be small enough to be successfully and rela- and decision-making over the management of tively quickly implemented, the achieved results organization’s data. It is a high-level planning analysed and lessons learned implemented in the and control mechanism over data management. following steps. (IBM, 2007, p.3) defines data governance as “a According to the well known Software En- quality control discipline for adding new rigor and gineering Institute, Capability Maturity Model discipline to the process of managing, using, im- for organization’s software development process proving, monitoring and protecting organizational which describes five-level graduated path from information”. It may be a part of organization’s initial (level 1), managed (level 2), defined (level IT governance. 3), quantitatively managed (level 4) to optimizing All organizations take care of data governance, (level 5) (IBM, 2007) describes a similar data whether formally or informally. Tendency is to governance maturity model. The model measures manage data governance more formally. The data governance competencies of an enterprise objectives of data governance are implemented based on 11 domains of data governance maturity: by various data governance programs or initia- organizational structures and awareness, steward- tives. Factors driving data governance programs ship, policy, value creation, data risk management include the need for implementation of EIS as and compliance, information security and privacy, a core system with a consistent enterprise-wide data architecture, data quality management, clas- definition of data, the problems of organization’s sification and metadata, information lifecycle data overload, data integration, data security etc. management; and audit information, logging and Some external drivers are regulatory requirements reporting. for more effective controls for information trans- parency, such as Basel II and Sarbanes-Oxley, the need to compete on market using better analytics Conclusion and having the right information to provide reli- able insights into business. Challenges of successful data management are Data governance is not a typical IT project as it numerous and diverse, varying from technologi- is declared at the first Data Governance Conference cal to conceptual. The chapter presented the most in 2006. Based on the experience of successful challenging aspects of data management classified data governance programs it is agreed that data into three classes. The first combines data avail- governance is more than 80 percent communica- ability, data integrity and data security, which tion and that data governance must focus on people, serve as data continuity aspects that are important process and communication before considering for the continuous provision of data in business any technology implementation. Organization’s processes and for decision-making purposes. The data governance body may consist of executive aspects in the second class enable innovative, bet- leadership, line-of-business and project managers, ter, more efficient and more effective data usage. and data stewards. The problems of data overload, data integration, Data governance must follow a strategic ap- data quality, data degradation, data ownership or proach considering first the current state, devising stewardship, data privacy, and data visualization

126 Challenges of Data Management in Always-On Enterprise Information Systems

are described. The last aspect is of managerial DAMA. (2008). DAMA-DMBOK: Functional nature. Data governance is important for planning, framework, version. 3. Retrieved on August 16, supervising and controlling of all management 2008, from http://www.dama.org/files/public/ activities exercised to improve organizational DMBOK/DI_DAMA_DMBOK_en_v3.pdf data and information. Davenport, T. H. (1993). Process innovation: Re- It is been shown that data management chal- engineering work through information technology. lenges are numerous. A data management analysis Boston: Harvard Business School Press. in a typical organization will most probably reveal that some data management aspects are more EU. (1995). Directive 95/46/EC of the European problematic than others and resolving data man- Parliament and the council on the protection agement issues can be dynamic. Consequently, of individuals with regard to the processing of data governance will constantly need to discover personal data and on the free movement of such novel and innovative ways to deal with data man- data. Retrieved on September 2, 2008, from http:// agement problems. ec.europa.eu/justice_home/fsj/privacy/docs/95- 46-ce/dir1995-46_part1_en.pdf Gartner. (2005). Business drivers and issues in References enterprise . Retrieved W3C. (2008). Platform for privacy preferences on September 2, 2008, from http://www.avanade. (P3P) project. Retrieved on September 2, 2008, com/_uploaded/pdf/avanadearticle4124441.pdf from www.w3.org/P3P/ Gartner Inc. (2006). Defining, cultivating, and Allen, J. H. (2001). The CERT guide to system measuring enterprise agility. Retrieved on Sep- and network security practices. Boston: Addison- tember 15, 2008, from http://www.gartner.com/ Wesley. resources/139700/139734/defining_cultivat- ing_and_mea_139734.pdf Bocij, P., Chaffey, D., Greasley, A., & Hickie, S. (2006). Business information systems. Harlow, Halevy, A. Y. (2001). Answering queries using UK: Prentice Hall Financial Times. views: A survey. The VLDB Journal, 10(4), 270–294. doi:10.1007/s007780100054 Brumec, J. (1997). A contribution to IS general taxonomy. Zbornik radova, 21(22), 1-14. Cali, Hall, D., & McMullen, S. H. (2004). Mathematical A., De Giacomo, G., & Lenzerini, M. (2001). techniques in multisensor data fusion. Norwood, Models for information integration: Turning USA: Artech House. local-as-view into global-as-view. In Proc. of Int. Healy, M. (2005). Enterprise data at risk: The 5 Workshop on Foundations of Models for Infor- danger signs of data integration disaster (White th mation Integration, 10 Workshop in the Series paper). Pittsburgh, PA: Innovative Systems. Foundations of Models and Languages for Data Retrieved on August 22, 2008, from http://www. and Objects. Retrieved on August 16, 2008, from dmreview.com/white_papers/2230223-1.html www.dis.uniroma1.it/~degiacom/papers/2001/ CaDL01fmii.ps.gz IBM. (2007). The IBM data governance council maturity model: Building a roadmap for effective Card, S. K., Mackinlay, J. D., & Shneiderman, data governance. Retrieved on August 22, 2008, B. (1999). Readings in information visualization: from ftp://ftp.software.ibm.com/software/tivoli/ Using vision to think. San Francisco: Morgan whitepapers/LO11960-USEN-00_10.12.pdf Kaufmann.

127 Challenges of Data Management in Always-On Enterprise Information Systems

IBM. (2008). RPO/RTO defined. Retrieved Rahm, E., & Bernstein, P. A. (2001). A survey of on September 15, 2008, from http://www. approaches to automatic schema matching. The ibmsystemsmag.com/mainframe/julyaugust07/ VLDB Journal, 10(4), 334–350. doi:10.1007/ ittoday/16497p1.aspx s007780100057 IDS. (2008). IDS white paper: The diverse and Schmidt, K. (2006). High avalability and disaster exploding digital universe. Framingham: IDC. recovery. Berlin: Springer. Juran, J. M., & Godfrey, A. B. (1999). Juran’s Scofield, M. (1998). Issues of data ownership. quality handbook. McGraw-Hill. DM Review Magazine. Retrieved on August 29, 2008, from http://www.dmreview.com/is- Lakshmanan, L. V., Sadri, F., & Subramanian, sues/19981101/296-1.html S. N. (2001). SchemaSQL: An extension to SQL for multidatabase interoperability. ACM Trans- Ullman, J. D. (1997). Information integration using actions on Database Systems, 26(4), 476–519. local views. In F. N. Afrati & P. Kolaitis (Eds.), doi:10.1145/503099.503102 Proc. of the 6th Int. Conf. On Database Theory (ICDT’97). (LNCS 1186, pp. 19-40). Delphi. Langseth, J. (2004). Real-time data warehousing: Challenges and solutions. Retrieved on August VisualLiteracy. (2008). Visual literacy: An 29, 2008, from http://DSSResources.com/papers/ e-learning tutorial on visualization for commu- features/langseth/langseth02082004.html nication, engineering, and business. Retrieved on September 3, 2008, from http://www.visual- Lu, J. J. (2006). A data model for data integration. literacy.org/ (ENTCS 150, pp. 3–19). Retrieved on August 16, 2008, from http://www.sciencedirect.com Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can we ever escape from data over- Motro, A., & Anokhin, P. (2006). Fusionplex: load? A cognitive systems diagnosis. Cognition Resolution of data inconsistencies in the integra- Technology and Work, 4, 22–36. doi:10.1007/ tion of heterogeneous information sources. [from s101110200002 http://www.sciencedirect.com]. Information Fu- sion, 7, 176–196. Retrieved on August 16, 2008. Xu, L., & Embley, D. W. (2004). Combining the doi:10.1016/j.inffus.2004.10.001 best of global-as-view and local-as-view for data integration. Information Systems Technology and Oracle. (2008). Oracle® database high avail- Its Applications, 3rd International Conference ability overview 11g release 1 (11.1). Retrieved on ISTA’2004 (pp. 123-136). Retrieved on September September 8, 2008, from http://download.oracle. 2, 2008, from www.deg.byu.edu/papers/PODS. com/docs/cd/B28359_01/server.111/b28281/ integration.pdf overview.htm#i1006492

128