Building a Data Warehousing and Data Mining from Course

Information Technology for People-Centred Development (ITePED 2011)

BUILDING DATA WAREHOUSING AND DATA MINING FROM COURSE MANAGEMENT SYSTEMS: A Case Study of Federal University of Technology (FUTA) Course Management Information Systems

*Akintola K.G., ** Adetunmbi A.O. **Adeola O.S. *Department of Computer Science, University of Houston-Victoria, Texas 77901 United States of America. **Department of Computer Science, Federal University of Technology, Akure, Ondo State, Nigeria

ABSTRACT In recent years, decision support systems otherwise called business Intelligence (BI) have become an integral part of organization's decision making strategy. Organizations nowadays are competing in the global market. In order for a company to gain competitive advantage over the others and also to help make better decisions, Data warehousing cum Data Mining are now playing a significant role in strategic decision making. It helps companies make better decisions, streamline work-flows, and provide better customer services. This paper gives the report about developing data warehouse for business management using the FUTA Student-Course management system as a case study. It describes the process of data warehouse design and development using Microsoft SQL Server Analysis Services. It also outlines the development of a data cube as well as application of Online Analytical processing (OLAP) tools and Data Mining tools in data analysis. It was concluded that the effective use Data-Warehousing and Datamining in Enterprise resource management system will promote the rapid growth of major companies.

Keywords: OLTP, Data warehouse, Data Mining, Dimensional modeling, OLAP

1.0 INTRODUCTION lack of historical data in OLTP makes it In simple terms, a data warehouse unsuitable to provide a comprehensive (DW) is a pool of data produced to support information about the operations of the decision making; It is also a repository of business. DW on the other hand, provides a current and historical data of potential central repository of historical data which interest to managers throughout the provides an integrated platform for historical organization. Data are usually structured to analysis of data. With a data warehouse and be available in a form ready for analytical Online Analytical Processing (OLAP), users processing activities (e.g. online analytical can perform better data analysis and gain processing (OLAP), data mining, querying, better knowledge from the repository data. reporting and other decision supporting According to Stephen Brobst and Joe applications). A data warehouse is a subject- Rarey (2003) Five stages of decision support oriented, integrated, time-variant, non- were identified in data-warehouse: volatile collec-tion of data in support of management's decision-making process. Stage 1: Reporting [Efraim T. et al., 2010]. The day-to-day The initial stage of data warehouse operations of an organization are done by deployment typically focuses on reporting using the OLTP system. This system is good from a single source of truth within an for normal operations and few decision organization. The biggest challenge in Stage making but the system is inadequate when it 1 data warehouse deployment is data inte- comes to strategic decision support. The gration.

Stage 2: Analyzing confined to strategic decision-making In a Stage 2 data warehouse processes. However, tactical decision deployment, decision-makers focus less on support does not replace strategic decision what happened and more on why it support. Rather, an active data warehouse happened. Analysis activities are concerned supports the coexistence of both types of with drilling down beneath the numbers on a workloads. report to slice and dice data at a detailed level. Performance is also a lot more The future of data warehousing important in a Stage 2 data warehouse We are moving to the stage of a data implementation because the information ware housing applications that can provide repository is used much more interactively. information to many decision makers operational, strategic, and tactical and also Stage 3: Predicting to the customers as well in an integrated As an organization becomes well- fashion. entrenched in quantitative decision-making techniques and experiences the value Data Mining proposition for understanding the “whats” Data mining is primarily used today and “whys” of its business dynamics, the by companies with a strong consumer focus next step is to leverage information for - retail, financial, communication, and predictive purposes. marketing organizations. It enables these Advanced data mining methods often companies to determine relationships among employ complex mathematical functions "internal" factors such as price, product such as logarithms, exponentiation, trigono- positioning, or staff skills, and "external" metric functions and sophisticated statistical factors such as economic indicators, functions to obtain the predictive character- competition, and customer demographics. istics desired. And, it enables them to determinethe impact on sales, customer satisfaction, and Stage 4: Ope-rationalizing corporate profits. Finally, it enables them to Ope-rationalization in Stage 4 of the "drill down" into summary information to evolution starts to bring us into the realm of view detail transactional data (Bill Palace active data warehousing. Whereas stages 1 1996 ) to 3 focus on strategic decision-making With data mining, a retailer could within an organization, Stage 4 focuses on use point-of-sale records of customer tactical decision support. Ope-rationalizing purchases to send targeted promotions based typically means providing access to on an individual's purchase history. By information for immediate decision-making mining demographic data from comment or in the field. Two examples are (1) inventory warranty cards, the retailer could develop management with just-in-time replenishment products and promotions to appeal to and (2) scheduling and routing for package specific customer segments. delivery. Many retailers are moving toward For example, Blockbuster Entertain- vendor managed inventory, with a retail ment mines its video rental history database chain and the manufacturers that supply it to recommend rentals to individual working as partners. The goal is to reduce customers. American Express can suggest inventory costs through more efficient products to its cardholders based on analysis supply chain management. of their monthly expenditures. (Bill Palace 1996). Stage 5: Active Warehousing WalMart is pioneering massive data An active data warehouse delivers mining to transform its supplier relation- information and enables decision support ships. WalMart captures point-of-sale throughout an organization rather than being transactions from over 2,900 stores in 6

countries and continuously transmits this data to its massive 7.5 terabyte Teradata data Data mining consists of five major warehouse. WalMart allows more than elements: 3,500 suppliers, to access data on their . Extract, transform, and load transaction products and perform data analyses. These data onto the data warehouse system. suppliers use this data to identify customer . Store and manage the data in a buying patterns at the store display level. multidimensional database system. They use this information to manage local . Provide data access to business analysts store inventory and identify new and information technology profess- merchandising opportunities. In 1995, sionals. WalMart computers processed over 1 . Analyze the data by application soft- million complex data queries (Bill Palace ware. 1996). . Present the data in a useful format, such as a graph or table. How data mining works While large-scale information techno- Different levels of analysis are available: logy has been evolving separate transaction . Artificial neural networks: Non-linear and analytical systems, data mining provides predictive models that learn through the link between the two. Data mining training and resemble biological neural software analyzes relationships and patterns networks in structure. in stored transaction data based on open- . Genetic algorithms: Optimization tech- ended user queries. Several types of niques that use processes such as genetic analytical software are available: statistical, combination, mutation, and natural machine learning, and neural networks. selection in a design based on the Generally, any of four types of relationships concepts of natural evolution. are sought: . Decision trees: Tree-shaped structures . Classes: Stored data is used to locate that represent sets of decisions. These data in predetermined groups. For decisions generate rules for the example, a restaurant chain could mine classification of a dataset. Specific customer purchase data to determine decision tree methods include Classifi- when customers visit and what they cation and Regression Trees (CART) typically order. This information could and Chi Square Automatic Interaction be used to increase traffic by having Detection (CHAID) . CART and CHAID daily specials. are decision tree techniques used for . Clusters: Data items are grouped classification of a dataset. They provide according to logical relationships or a set of rules that you can apply to a new consumer preferences. For example, data (unclassified) dataset to predict which can be mined to identify market records will have a given outcome. segments or consumer affinities. CART segments a dataset by creating 2- . Associations: Data can be mined to way splits while CHAID segments using identify associations. The beer-diaper chi square tests to create multi-way example is an example of associative splits. CART typically requires less data mining. preparation than CHAID. . Sequential patterns: Data is mined to . Nearest neighbor method: A technique anticipate behavior patterns and trends. that classifies each record in a dataset For example, an outdoor equipment based on a combination of the classes of retailer could predict the likelihood of a the k record(s) most similar to it in a backpack being purchased based on a historical dataset (where k 1). Some- consumer's purchase of sleeping bags times called the k-nearest neighbour and hiking shoes. technique.

. Rule induction: The extraction of useful which are not enough to make strategic if-then rules from data based on decisions. With a data warehouse and OLAP, statistical significance. staff and management can perform roll-up . Data visualization: The visual inter- and drill-down operations to student pretation of complex relationships in enrollments by year, by semester by course multi-dimensional data. Graphics tools or any combination desired. The system will are used to illustrate data relationships provide decision support that is flexible and (Bill Palace 1996). user-friendly. The University would like to use the data-warehouse to answer questions 2.0 THE CASE PROJECT OF STUD- such as: What is the trend of student ENT COURSE MANAGEMENT admission, student’s enrollments, Lecturers SYSTEM room schedule, number of annual enrollment The case study used as a model for etc. These type of questions need a lot of the project is a student course management historical data to generate which the current system in Federal University of Technology system OLTP system cannot support. Akure Nigeria. The school registers students every semester and students take courses 3.0 DESIGNING THE DATA WARE- and exams. These are being managed by an HOUSE online transaction processing (OLTP) The following discussion outlines the system. A simplified representation of the process of the data warehouse design. It logical design of the OLTP system is shown involves the logical design, the OLAP in Figure 1. design, and Data mining design.

The Motivation for the Data Warehouse The logical Design system The logical design of data-warehouse The day-to-day operations of the is defined by the dimensional data modeling school rely heavily upon the OLTP system. approach. The dimensioning design process The staffs make use of OLTP system for followed in this project adheres to the management information system. The need methodology described by Kimball and for strategic decision necessitates the Ross (2002). development of the data warehouse. The OLTP can only handle few historical data

Figure 1. The logical design of OLTP for class registration database Unlike the Entity Relationship (ER) and modeling processes, the logical design of Unified Modeling Language (UML) data Data Warehouse (DW) is defined by the

dimensional data modeling approach. To warehouse model than a star schema, minimize the join operations which slow and is a type of star schema. It is called a down queries, normalization is not the snowflake schema because the diagram guiding principle in DW design. A schema of the schema resembles a snowflake. is a collection of database objects, including Snowflake schemas normalize dimen- tables, views, indexes, and synonyms. There sions to eliminate redundancy. That is, is a variety of ways of arranging schema the dimension data has been grouped objects in the schema models designed for into multiple tables instead of one large data warehousing. The following are the two table. For example, a product dimension types of schemas commonly used in table in a star schema might be dimensional data modeling. normalized into a products table, a • Star schema: The star schema is product_category table, and a perhaps the simplest data warehouse product_ manufacturer table in a schema. It is called a star schema snowflake schema. While this saves because the entity-relationship diagram space, it increases the number of of this schema resembles a star, with dimension tables and requires more points radiating from a central table. The foreign key joins. The result is more center of the star consists of a large fact complex queries and reduced query table and the points of the star are the performance dimension tables. A star schema is characterized by one or Dimensional Data Modeling approach more very large fact tables that contain The dimensional approach is quite the primary information in the data different from the normalization approach warehouse, and a number of much followed when designing a database for smaller dimension tables (or lookup daily operations. Figure 2 shows the five dimension tables used in the project. tables), each of which contains information about the entries for a Data Hierarchies in Dimensional tables particular attribute in the fact table. Each of the dimensions contains at Star schema facilitates quick least one hierarchy. The hierarchies allow response to queries. The core detailed users to analyze data aggregations using the values are stored in fact table. The OLAP. This allows related items to be dimensional info and hierarchies are grouped and summarized for high level kept in dimension tables analysis while retaining the ability to drill • Snowflake schema: The snowflake down to more specific product detail. schema is a more complex data

DimStudent DimFaculty DimClassroom PK StudentID PK FacultyID PK ClassroomID F i s r t N a m e F i s r t N a m e C l a s s r o o m LastName LastName BuildingName Sex Department Capacity

Major Sex

DimSchedule DimCourse PK ClassID PK CourseID a c a d y e a r C o u r s e N a m e Semester Credit WeekDay Description

TimeBlock

FigureThe 2: dimensional The Dimensional tables keep tables the sho datawing datathat hierarchies are used for analyzing the data

warehouse. For example we can look for ClassFact Table Faculty who taught a particular course. PK StudentID PK CourseID Fact Table PK ClassID Fact table contains dimension PK ClassroomID attributes and measures. Dimension PK FacultyID attributes are FKs or other attributes called Grade degenerate dimension (<

>). Measures Mark are the values to be aggregated when queries Registered group rows together. The Fact table is composed of two types of attributes: Figure 3: The Fact Table dimension attributes and measures. Figure 3 presents the fact table for configuration is termed a star schema. A star this research. In the table, the registered schema has been adopted in this research. field is used as a measure while the other Below is the diagram of the relationship keys are used to link the dimension tables. between the fact table and the Dimensional tables. The Star Schema: In the figure 4, the relationships Data Warehouse are commonly between the fact table and dimension tables organized with one large central fact table, are depicted in a star form. and many smaller dimensions tables. This

Figure 4. The Star schema

4.0 IMPLEMENTATION the creation of integration service project 4.1 Data Warehouse using the Business Intelligence package in Transporting Data from OLTP the SQL server. After creating the project, a Database to Data Warehouse data-source is then created followed by the SQL Server provides the tool SSIS to creation of OLEDB data-source and assist in transporting data in and out of a destination. The following tables show the database. It involves creation of the dimensional tables and fact table created in dimension table’s data structure, followed by the course of this project. a. DimStudent

b. DimFaculty

c. DimSchedule

d. DimCourse

e. DimClassroom

FactTable

Fig. 5 Populated Dimensional tables and fact table

The tables 5 a, b, c, d, e and f show Figure 6 shows how a star schema the typical contents of the fact and dimen- described earlier is transformed into an sion tables after some records have been internal form, which SQL business intelli- entered into the tables. gence can use to fetch appropriate data requested by the users. 4.2 Online Analytical Process (OLAP) After Data Warehouse has been 4.3 The Data Analysis Reports populated with data, the next step in the data The reports blows are the output warehouse development is to provide the results obtained as a result of roll-up and users with data analysis tool such as OLAP drill-down operations of OLAP performed and Data Mining to analyze the business on the data for various hierarchies of the process for decision making. OLAP is a dimensions. service that automatically selects a set of summary views (tables), and saves these DimFac… summary views to disk. OLAP also manage these views and update them when the fact table has new data. To create this, we must DimSch… ClassF… DimStu… first create Analysis Services project using Business Intelligence Development Studio. Then, we define the data source, create the DimStu… DimStu… data source view and then create Cube. Figure below shows a screen shot of the Figure 6: Class Registration OLAP cube created with SQL Server. Cube

Figure 7. Data Analysis based on student gender and courses registered.

Figure 7 shows list of students’ sex and by course. It is interactively generated by OLAP based on users’ choice.

Registration by academic Year

Figure 8 Data Analysis based on Academic Year and course registered by students

This figure shows count of students users choice. taught by lecturer in academic year 2009. It is interactively generated by OLAP based on

Lecture Room by Time-block showing Number of students by Courses

Figure 9 Data Analysis based on Lecture rooms, time tables versus Courses registered

This figure shows Lecture rooms allocation by course and number of students Student ID participating in that course. Such a report can be used for effective lecture theater allocation for course lecturing or Mark examination. It is interactively generated by OLAP based on users’ choice. Faculty ID

4.4 The Data Mining Figure 10: Data Clustering Mining that (Discovering Hidden knowledge finds Relationship between using Data Mining). Data mining is a Students Marks and Faculties process to have the computer search for correlations in the data, and present This figure shows the result obtained promising hypothesis to users for from the data-mining model above. It shows consideration. The problems that can be class group of Mark 90 and group of mark solved by data mining are: 75. The class a student favours between (a) Categorization: Classify a given set mark 90 to 75 and the degree of membership cases of such a class is shown by the bar. Also (b) Clustering: Find the natural shown are faculties and the degree of their groupings of a given set of cases. membership in awarding marks in the range (c) Association rule: Find out what of 90 and that of group of 75 and the degree items are frequently processed of their membership of such a group. together.

A data-mining that clusters students degree of membership of such a class is and Lectures showing the class a student shown by the bar. favors’ between mark 90 to 75 and the

Figure 11: Discriminant mining showing Relationship between Students Marks and Faculties

5.0 CONCLUSION Resources, In this paper, we have been able http://infogoal.com/dmc/dmcdwh to demonstrate the process of designing .htm. and developing data-warehouse and data BIN (2007) Business Intelligence mining applications using SQL server Network, http://www.beye- business intelligence Development tools network.com/home/ using a case study in an academic Bill Palace (1996). Data Mining environment. It is to be noted however Technology Note prepared for that this technique can be applied to any Management 274A Anderson organization wishing to implement Graduate School of Mana- business intelligence as part of their gement at UCLA strategic decision support operations. http://www.ander- The power of Data-Warehousing in data son.ucla.edu/faculty/jason.frand/t analysis is tremendous and data-mining eacher/technologies/palace/index can discover hidden treasures in the .htm. data-warehouse. David Heise: An online publication on Organizations, particularly in Data Warehousing in higher Nigeria can begin to implement this Education, project as part of their strategic decision http://dheise.andrews.edu/dw/D making process tools. With it they can WData.htm begin to see trends of things historically Fang, R. and Tuladhar, S. 2006. as their day to day operational data “Teaching Data Warehousing and accumulates over the years. They can Data Mining in a Graduate forecast the future using neural Program in Informa-tion Networks, regression analysis, and other Technology,” Journal of Com- data mining operations incorporated into puting Sciences in Colleges, Vol. SQL server Business Intelligence. Data- 21, Issue 5, pp. 137-144. warehouse is the solution we need at the Kimball R. and Ross M. (2002). The moment to catapult our business or our Data Warehouse Toolkit, second organization to the next level. edition. John Wiley and Sons, Inc., USA. 6.0 REFERENCES Michael A. King (2009). A Realistic Efraim T., Jay E., Teng-Peng L., Ramesh Data Warehouse Project: An S, (2010). Decision Support and Integration of Microsoft Access Business Intelligence Systems and Microsoft Excel Advanced (8th ed) Prentice Hall. Infogold Features and Skills Journal of Data Warehouse, Data Mart, Data Information Technology Mining, and Decision Support Education Vol. 8, 2009

Innovations in Practice. Pierce E.M. (1999). “Developing and Deli-vering a Data Warehousing and Data Mining Course,” Communications of the AIS, Vol. 2, Article 16, pp. 1-22. Jacobson R. (2000). Microsoft SQL Server (2000). Analysis Services, Step by Step; Microsoft Press, Redmond, Washington. Stephen Brobst, Joe Rarey (2003). Five Stages of Data Warehouse Decision Support Evolution http://dssre- sources.com/papers/features/brob st&rarey01062003.html. Wierschem D., McMillan J. and McBroom R. (2003). “What Academia Can Gain from Building a data Warehouse,” Educause Quarterly, Number 1, pp. 41-46.

NIGERIA COMPUTER SOCIETY (NCS): 10TH INTERNATIONAL CONFERENCE – JULY 25-29, 2011