Scientific Data Mining and Knowledge Discovery
Total Page:16
File Type:pdf, Size:1020Kb
Scientific Data Mining and Knowledge Discovery “This page left intentionally blank.” Mohamed Medhat Gaber Editor Scientific Data Mining and Knowledge Discovery Principles and Foundations ABC Editor Mohamed Medhat Gaber Caulfield School of Information Technology Monash University 900 Dandenong Rd. Caulfield East, VIC 3145 Australia [email protected] Color images of this book you can find on www.springer.com/978-3-642-02787-1 ISBN 978-3-642-02787-1 e-ISBN 978-3-642-02788-8 DOI 10.1007/978-3-642-02788-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009931328 ACM Computing Classification (1998): I.5, I.2, G.3, H.3 °c Springer-Verlag Berlin Heidelberg 2010 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KuenkelLopka GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) This book is dedicated to: My parents: Dr. Medhat Gaber and Mrs. Mervat Hassan My wife: Dr. Nesreen Hassaan My children: Abdul-Rahman and Mariam “This page left intentionally blank.” Contents Introduction ........................................................................ 1 Mohamed Medhat Gaber Part I Background Machine Learning.................................................................. 7 Achim Hoffmann and Ashesh Mahidadia Statistical Inference ................................................................ 53 Shahjahan Khan The Philosophy of Science and its relation to Machine Learning............. 77 Jon Williamson Concept Formation in Scientific Knowledge Discovery from a Constructivist View........................................................ 91 Wei Peng and John S. Gero Knowledge Representation and Ontologies .....................................111 Stephan Grimm Part II Computational Science Spatial Techniques .................................................................141 Nafaa Jabeur and Nabil Sahli Computational Chemistry.........................................................173 Hassan Safouhi and Ahmed Bouferguene String Mining in Bioinformatics..................................................207 Mohamed Abouelhoda and Moustafa Ghanem vii viii Contents Part III Data Mining and Knowledge Discovery Knowledge Discovery and Reasoning in Geospatial Applications.........................................................................251 Nabil Sahli and Nafaa Jabeur Data Mining and Discovery of Chemical Knowledge ..........................269 Lu Wencong Data Mining and Discovery of Astronomical Knowledge .....................319 Ghazi Al-Naymat Part IV Future Trends On-board Data Mining ............................................................345 Steve Tanner, Cara Stein, and Sara J. Graves Data Streams: An Overview and Scientific Applications ......................377 Charu C. Aggarwal Index .................................................................................399 Contributors Mohamed Abouelhoda Cairo University, Orman, Gamaa Street, 12613 Al Jizah, Giza, Egypt Nile University, Cairo-Alex Desert Rd, Cairo 12677, Egypt Charu C. Aggarwal IBM T. J. Watson Research Center, NY, USA, AL 35805, USA, [email protected] Ghazi Al-Naymat School of Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia, [email protected] Ahmed Bouferguene Campus Saint-Jean, University of Alberta, 8406, 91 Street, Edmonton, AB, Canada T6C 4G9 Mohamed Medhat Gaber Centre for Distributed Systems and Software Engineering, Monash University, 900 Dandenong Rd, Caul eld East, VIC 3145, Australia, [email protected] John S. Gero Krasnow Institute for Advanced Study and Volgenau School of Information, Technology and Engineering, George Mason University, USA, [email protected] Moustafa Ghanem Imperial College, South Kensington Campus, London SW7 2AZ, UK Sara J. Graves University of Alabama in Huntsville, AL 35899, USA, [email protected] Stephan Grimm FZI Research Center for Information Technologies, University of Karlsruhe, Baden-W¨urttemberg, Germany, [email protected] Achim Hoffmann University of New South Wales, Sydney 2052, NSW, Australia Nafaa Jabeur Department of Computer Science, Dhofar University, Salalah, Sultanate of Oman, nafaa [email protected] Shahjahan Khan Department of Mathematics and Computing, Australian Centre for Sustainable Catchments, University of Southern Queensland, Toowoomba, QLD, Australia, [email protected] Ashesh Mahidadia University of New South Wales, Sydney 2052, NSW, Australia ix x Contributors Wei Peng Platform Technologies Research Institute, School of Electrical and Computer, Engineering, RMIT University, Melbourne VIC 3001, Australia, [email protected] Cara Stein University of Alabama in Huntsville, AL 35899, USA, [email protected] Hassan Safouhi Campus Saint-Jean, University of Alberta, 8406, 91 Street, Edmonton, AB, Canada T6C 4G9 Nabil Sahli Department of Computer Science, Dhofar University, Salalah, Sultanate of Oman, nabil [email protected] Steve Tanner University of Alabama in Huntsville, AL 35899, USA, [email protected] Lu Wencong Shanghai University, 99 Shangda Road, BaoShan District, Shanghai, Peoples Republic of China, [email protected] Jon Williamson Kings College London, Strand, London WC2R 2LS, England, UK, [email protected] Introduction Mohamed Medhat Gaber “It is not my aim to surprise or shock you – but the simplest way I can summarise is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to do these things is going to increase rapidly until – in a visible future – the range of problems they can handle will be coextensive with the range to which the human mind has been applied” by Herbert A. Simon (1916-2001) 1Overview This book suits both graduate students and researchers with a focus on discovering knowledge from scientific data. The use of computational power for data analysis and knowledge discovery in scientific disciplines has found its roots with the revo- lution of high-performance computing systems. Computational science in physics, chemistry, and biology represents the first step towards automation of data analysis tasks. The rational behind the development of computational science in different ar- eas was automating mathematical operations performed in those areas. There was no attention paid to the scientific discovery process. Automated Scientific Discov- ery (ASD) [1–3] represents the second natural step. ASD attempted to automate the process of theory discovery supported by studies in philosophy of science and cognitive sciences. Although early research articles have shown great successes, the area has not evolved due to many reasons. The most important reason was the lack of interaction between scientists and the automating systems. With the evolution in data storage, large databases have stimulated researchers from many areas especially machine learning and statistics to adopt and develop new techniques for data analysis. This has led to a new area of data mining and knowledge discovery. Applications of data mining in scientific applications have M.M. Gaber () Centre for Distributed Systems and Software Engineering, Monash University, 900 Dandenong Rd, Caulfield East, VIC 3145, Australia e-mail: [email protected] M.M. Gaber (ed.), Scientific Data Mining and Knowledge Discovery: Principles 1 and Foundations, DOI 10.1007/978-3-642-02788-8 1, © Springer-Verlag Berlin Heidelberg 2010 2 M.M. Gaber been studied in many areas. The focus of data mining in this area was to analyze data to help understanding the nature of scientific datasets. Automation of the whole scientific discovery process has not been the focus of data mining research. Statistical, computational, and machine learning tools have been used in the area of scientific data analysis. With the advances in Ontology and knowledge represen- tation, ASD has great prospects in the future. In this book, we provide the reader with a complete view of the different tools used in analysis of data for scientific dis- covery. The book serves as a starting point for students and researchers interested in this area. We hope that the book represents an important step towards evolution of scientific data mining and automated scientific discovery. 2 Book Organization The book is organized into four parts. Part I provides the reader with background of the disciplines that contributed to the scientific discovery. Hoffmann and Mahi- dadia provided a detailed introduction to the area of machine learning in Chapter Machine Learning. Chapter Statistical Inference by Khan gives the reader a clear start-up overview of the field of statistical inference. The relationship between sci- entific discovery and philosophy of science is provided by Williamson in Chapter