APPENDIX-I Data Mining Research Projects
Total Page:16
File Type:pdf, Size:1020Kb
APPENDIX-I Data Mining Research Projects Abstract. We all live in the information age. The amount of information being collected by businesses, companies, and agencies is large. Recent advances in tech- nologies to automate and improve data collection have only increased the volumes of data. The need for collecting data is to extract useful information. Data mining is primarily the process of knowledge discovery in databases. The data of interest is the previously unknown and potentially useful information contained in the database. Data mining therefore has become a research area of interest to effectively answer to this need. We present the data mining research projects collected from various universities and laboratories. A.1 National University of Singapore: Data Mining Research Projects A.1.1 Cleaning Data for Warehousing and Mining Data from real-world sources are often erroneous, incomplete, and inconsis- tent, which can result from operator error, system implementation flaws, etc. Such low-quality data is not suitable for effective data mining. Data cleaning has been identified as an important problem. However, little progress has been made thus far. In this project, we study the issues related to data cleaning with the aim of developing an engineering approach that can be useful to the user. The project consists of three phases: • Identify and categorize the possible errors in data from multiple sources; • Survey the available and potentially usable techniques to address the prob- lem; and • Develop a system that can identify and resolve some of the errors. (Contact person. Dr. Lu. Hongjun.) 650 APPENDIX-I Data Mining Research Projects A.1.2 Data Mining in Multiple Databases It is common that many databases are kept in an organization. They are collected to serve different purposes. Data mining in individual databases has attracted a lot of attention. Some encouraging results have been achieved. It is time now to consider how we can make use of all the databases in an organization for data mining. Many issues remain unresolved. Significant ones are: • Will one database help in the data mining of another database? • How can we consider multiple databases simultaneously for data mining? • Can current data mining techniques help in this new situation or should we develop new techniques? • What is so special about mining of multiple databases? (Contact person Dr. Liu Huan.) A.1.3 Intelligent WEB Document Management Using Data Mining Techniques With the development of Internet/Web technology, the volume of Web doc- uments increases dramatically. Effective management of the documents is becoming an important issue. The objective of this project is to build an intelligent system that can help Web masters manage Web documents so that they can serve the users better. We first survey the available and potentially useful techniques for discovering access patterns of Web documents stored in an information provider’s Web server. The major issues include establishing measurements and heuristics on user access patterns and developing tech- niques to discover and maintain such discovered patterns. The results are then expanded in the direction of using the discovered user access patterns to man- age Web documents so that information subscribers can access information of interest more efficiently. Techniques to be investigated include clustering of web documents, prefetching and caching, and customized linkage of Web documents. (Contact person: Dr. Lu. Hongjun.) A.1.4 Data Mining with Neural Networks While the use of neural networks for pattern classification has been common in practice, neural networks have not been widely applied in data mining applications. The reason is that neural network decision process is not easily explainable in terms of rules that human experts can verify. In the past two years, we have investigated the problem of extracting rules from trained neural networks. Our results have been encouraging. The rules extracted by our algorithms are not only more concise than those generated from decision trees, but are, in general, more accurate. (Contact person: Dr. Rudy Setiono.) A.1 National University of Singapore: Data Mining Research Projects 651 A.1.5 Data Mining in Semistructured Data As the amount of data available on-line grows rapidly, more and more data is semistructured and hierarchical, i.e., the data has no absolute schema fixed in advance, and whose structure may be irregular or incomplete. Semistruc- tured data arises when the source does not impose a rigid structure and when the data is obtained by combining several heterogeneous data sources. As an example of semistructured, hierarchical data source is the Web. Other exam- ple of semistrucutured data include: the result of integrating heterogeneous data sources, Bib Tex files, genome databases, drug and chemical structures, libraries of programs, and more generally, digital libraries, and on-line doc- umentation. The goal of this project is to develop a general framework of mining associative patterns from semistructured and hierarchical data. (Con- tact person: Dr. Wang Ke.) A.1.6 A Data Mining Application – Customer Retention in the Port of Singapore Authority (PSA) “Consumer retention” is an increasingly pressing issue in today’s ever- competitive commercial arena. This is especially relevant and important for all sales and services related industries. Motivated by this real-world problem in the context of the Port of Singapore Authority (PSA), our work proposed a solution, which integrates the various techniques of data mining, such as decision-tree induction, deviation analysis, and mining multiple concept-level association rules to from an intuitive and novel approach to gauging cus- tomer’s loyalty and predicting their likelihood of defection. Immediate actions triggered by these “early warnings” are often the keys to the eventual reten- tion of the customer’s involved. (Contact person: Dr. Liu Huan, Md. Farhad Hussain, Ng Kian Sing.) A.1.7 A Belief-Based Approach to Data Mining The work focused on improving the existing “association rule” discovery tech- nique for practical applications. This was motivated by a collaborative effort with The Ministry of Education (MOE) to discover “interesting” knowledge in the Gifted Education Program (GEP). Although the “Apriori” algorithm can extract a large number of associative patterns, their sole reliance on the objec- tive criteria of “support” and “confidence” for the generation of strong rules has deprived the user of flexibility and control over what kind of rules is “inter- esting” and specifically required. To overcome the deficiencies, we introduced the subjective criterion of “payoff” and proposed a belief-based data mining framework to incorporate the user’s preferences (transient requirements) and beliefs (firm knowledge) into the process of association rule mining. With this, we are also able to find interesting “reliable exceptions,” as opposed to only 652 APPENDIX-I Data Mining Research Projects patterns having high predictive accuracy. This approach is further applied into solving the “interestingness” problem for the other tasks of data mining, such as “classification” and “correlation” rule mining. (Contact person: Dr. Liu Huan, Md. Farhad Hussain, Ng Kian Sing.) A.1.8 Discovering Interesting Knowledge in Database Knowledge discovery is the extraction of implicit, previously unknown, and potentially useful information from data. There is a growing realization and expectation that data, intelligently analyzed and presented, will be a valuable resource to be used for a competitive advantage. Over the past few years, many techniques have been developed for the discovering of hidden knowledge (which is normally expressed as patterns) in databases. While these techniques were shown to be successful in numerous industrial applications, new problems have also emerged. One of the problems is that it is all too easy to discover a huge number of patterns in a database, and most of these patterns are actually useless or uninteresting to the user. But because of the huge number, it is difficult for the user to comprehend and to identify those patterns that are useful to him/her. To prevent the user from being overwhelmed by the large number of patterns, techniques are needed to identify those interesting ones and/or to rank all the patterns according to their interestingness. In this project, we study this problem and design techniques to perform the identification and/or the ranking tasks. (Contact person: Dr. Liu Bing.) A.1.9 Data Mining for Market Research Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. It is now well realized that data, intelligently analyzed and presented, will be a valuable resource to be used for a competi- tive advantage. In marketing research, a marketing manager typically analyzes customer data in order to target the right customers for his company’s prod- ucts, to retain the profitable customers, and to attract new customers. There is a growing field of research called database marketing that makes use of data mining techniques to help marketing managers, to analyze the data to seg- ment the market, to position their products, and to study customers buying behaviors and patterns. (Contact person: Dr. Liu Bing.) A.1.10 Data Mining in Electronic Commerce Electronic commerce is the cutting edge of business today. It allows consumers and companies to buy and sell on the computer network. Clearly, with the electronic commerce, a huge amount of information can be generated and collected, e.g., information about how customers search for products, what they buy, how they make purchase decisions, etc. Such information can be A.1 National University of Singapore: Data Mining Research Projects 653 analyzed and mined to discover interesting patterns. These patterns can then be used to better serve customers and to help marketing and advertising activities. (Contact person: Dr. Liu Bing.) A.1.11 Multidimensional Data Visualization Tool The amount of data stored in databases is growing very rapidly. It has been es- timated that the amount of information in the world doubles every 20 months.