<<

The Third International Conference on Knowledge Discovery and

August 14–17 1997, Newport Beach, California Sponsored by the American Association for Artificial Intelligence

Cosponsored by General Motors Research, The National Institute for Statistical Sciences, and NCR Corporation

In cooperation with The American Statistical Association Welcome to KDD-97!

KDD-97 Organization

General Conference Chair William Eddy, Carnegie Mellon Sally Morton, Rand Corporation University Richard Muntz, University of California Ramasamy Uthurusamy, General Motors Charles Elkan, University of California at at Los Angeles Corporation San Diego Raymond Ng, University of British Usama Fayyad, Microsoft Research Columbia Program Cochairs Ronen Feldman, Bar-Ilan University, Steve Omohundro, NEC Research David Heckerman, Microsoft Research Israel Gregory Piatetsky-Shapiro, Heikki Mannila, University of Helsinki, Jerry Friedman, Stanford University GTE Laboratories Finland Dan Geiger, Technion, Israel Daryl Pregibon, Bell Laboratories Daryl Pregibon, AT&T Laboratories Clark Glymour, Carnegie Mellon Raghu Ramakrishnan, University of University Wisconsin, Madison Publicity Chair Moises Goldszmidt, SRI International Patricia Riddle, Boeing Computer Georges Grinstein, University of Services Paul Stolorz, Jet Propulsion Laboratory Massachusetts, Lowell Ted Senator, NASD Regulation Inc. Tutorial Chair Jiawei Han, Simon Fraser University, Jude Shavlik, University of Wisconsin at Canada Madison Padhraic Smyth, University of California, David Hand, Open University, UK Wei-Min Shen, University of Southern Irvine Trevor Hastie, Stanford University California David Heckerman, Microsoft Corpora- Arno Siebes, CWI, Netherlands Demo and tion Avi Silberschatz, Bell Laboratories Poster Sessions Chair Haym Hirsh, Rutgers University Evangelos Simoudis, IBM Almaden Jim Hodges, University of Minnesota Research Center Tej Anand, NCR Corporation Se June Hong, IBM T .J. Watson Research Andrzej Skowron, University of Warsaw Awards Chair Center Padhraic Smyth, University of California Tomasz Imielinski, Rutgers University at Irvine Gregory Piatetsky-Shapiro, Yannis Ioannidis, University of Wisconsin Ramakrishnan Srikant, IBM Almaden GTE Laboratories Lawrence D. Jackel, AT&T Laboratories Research Center David Jensen, University of Massachusetts John Stasko, Georgia Institute of Technol- Program Committee Members Michael , Massachusetts Institute ogy Helena Ahonen, University of Helsinki of Technology Salvatore J. Stolfo, Columbia University Tej Anand, NCR Daniel Keim, University of Munich Paul Stolorz, Jet Propulsion Laboratory Ron Brachman, AT&T Laboratories Willi Kloesgen, GMD, Germany Hanna Toivonen, University of Helsinki Carla Brodley, Purdue University Ronny Kohavi, Silicon Graphics Alexander Tuzhilin, New York University, Dan Carr, George Mason University David D. Lewis, AT&T Laboratories – Stern School Peter Cheeseman, Research Ramasamy Uthurusamy, General Motors NASA AMES Research Center David Madigan, University of R&D Center David Cheung, University of Hong Kong Washington Graham Wills, Bell Laboratories Wesley Chu, University of California at Heikki Mannila, University of Helsinki David Wolpert, IBM Almaden Research Los Angeles Brij Masand, GTE Laboratories Center Gregory Cooper, University of Pittsburgh Gary McDonald, General Motors Jan Zytkow, Wichita State University Robert Cowell, City University, UK Research Bruce Croft, University of Massachusetts Eric Mjolsness, University of California at Amherst at San Diego

2 KDD–97 Thursday, August 14th

8:00AM–6:00 PM 1:30–3:30 PM 4:00–6:00 PM Tutorials Tutorial 4 Tutorial 7 Exploratory Data Analysis Using Statistical Models for Categorical 8:00–10:00 AM (SINGLE SESSION) Interactive Dynamic Graphics Response Data Tutorial 1 Deborah Swayne, Bell Communications Re- William DuMouchel, AT&T Research Data Mining and KDD: search and Diane Cook, Iowa State University An Overview 6:30–7:30 PM Usama Fayyad, Microsoft Research and Opening Reception Evangelos Simoudis, IBM 1:30–3:30 PM Tutorial 5 California Ballroom, Newport Beach Mar- 10:00–10:30 AM OLAP and Data Warehousing riott Hotel and Tennis Club Coffee Break Surajit Chaudhuri, Microsoft Research and Umesh Dayal, Hewlett Packard Laboratories

10:30 AM–12:30 PM Tutorial 2 3:30–4:00 PM Modeling Data and Discovering Coffee Break Knowledge David Hand, Open University, UK 4:00–6:00 PM Tutorial 6

10:30 AM–12:30 PM Visual Techniques for Exploring Tutorial 3 Databases Text Mining - Theory and Practice Daniel Keim, University of Munich, Ger- Ronen Feldman, Bar-Ilan University, many Israel

12:30–1:30 PM Lunch

KDD-97 Tutorial Program All tutorials will be presented on Thursday, August 14, 1997 in sections of the Pacific Ballroom. Admission to the tutorials is included in your conference registration fee. Registrants can attend up to four consecutive tutorials, including four tutorial syllabi.

TIME SESSION 1 SESSION 2 8:00 to 10:00 AM T1 – Fayyad and Simoudis (single session) 10:30 AM to 12:30 PM T2 – Hand T3 – Feldman 1:30 to 3:30 PM T4 – Swayne and Cook T5 – Chaudhuri and Dayal 4:00 to 6:00 PM T6 – Keim T7 – DuMouchel

KDD–97 3 Friday, August 15th

All sessions are plenary sessions and will be 10:30–12:15 PM 11:15 AM–12:15 PM held in the Pacific Ballroom, Newport Beach Keynote Address and Poster Previews Session Marriott Hotel and Tennis Club Poster Previews Session (Authors A-J)

12:15–2:00 PM 8:30–10:00 AM 10:30–11:15 AM Keynote Address and Keynote Address: Searching Lunch for Causes and Predicting Invited Talk Session Interventions (Program Committee Luncheon, New- Clark Glymour, University of California port North Room, lobby level, Newport 8:30 –8:45 AM San Diego and Carnegie Mellon University Beach Marriott Hotel and Tennis Club) Welcome and Introduction 2:00–3:00 PM KDD-97 General Conference Chair: The problem in many data mining tasks Poster Previews Session Ramasamy Uthurusamy and KDD-97 Pro- is to predict features of a new, unob- gram Cochairs: David Heckerman, Heikki (Authors K-Z) Mannila, and Daryl Pregibon served sample from a recorded sample, assuming the sampling distribution 3:00–7:00 PM does not change. In causal data mining Exhibits, Demos, Poster Sessions 8:45–9:30 AM the task is to predict the features of new Keynote Address: From Large to unobserved samples from a recorded California Ballroom, Newport Beach Huge: A Statistician’s Reactions to sample, assuming any unobserved sam- Marriott Hotel and Tennis Club KDD & DM ple results from an intervention which 3:00 –4:20 PM Peter J. Huber, University of Bayreuth, Ger- directly alters the underlying probabili- many Poster Session 1 ty distribution of some variables in a known way, and alters others indirectly Authors A-G The and AI communities are through the influences of the directly confronted by the same challenge—the 4:20 –5:40 PM manipulated variables. Data mining in onslaught of ever larger data collec- the sciences and for business, military Poster Session 2 tions—but the two communities have or public policy is often concerned with Authors H-O reacted independently and differently. discovering causal structure, although What could they learn from each other that focus is sometimes obscured by the 5:40 –7:00 PM if they looked over the fence? What is belief that causation is especially myste- Poster Session 3 amiss on either side? rious. Causation is not probability, but Authors P-Z 9:30–10:00 AM it is no more (and no less) mysterious. (Coffee and soft drinks will be available Invited Talk: and This lecture will illustrate why data in the foyer from 3:00–7:00 PM.) KDD: New Developments mining for causes using techniques de- Thomas Dietterich, University of Oregon signed for recognition or classification is unwise, and will illustrate principles Machine learning research is pursuing of causal data mining with several in- many directions relevant to KDD. This teresting cases that use both Bayesian talk will review two exciting lines of re- and constraint based methods. search: learning with ensembles (com- mittees, bagging, boosting, etc.) and learning with stochastic models. I will also briefly mention current research in reinforcement learning and methods for scaling up machine learning algo- rithms.

10:00–10:30 AM Coffee Break

4 KDD–97 Saturday, August 16th

8:30–9:45 AM 11:00–11:30 AM Propulsion Laboratory, California Institute Paper Session Invited Talk: Highlights from the of Technology; Andrew Fraser, Portland State Database Methodology Joint Statistical Meetings University Jim Hodges, University of Minnesota

8:30–8:55 AM 2:30–3:00 PM Computing Optimized Rectilinear Statisticians and data-miners have a lot in Automated Discovery of Active Regions for Association Rules common but their paths tend not to cross. Motifs in Three Dimensional Kunikazu Yoda, Takeshi Fukuda, Yasuhiko Thus, for example, many statisticians are Molecules missing out on computing and other ideas Morimoto, Shinichi Morishita, and Takeshi Xiong Wang and Jason T.L. Wang, New Jer- familiar to data-miners, and data-miners Tokuyama, IBM Tokyo Research Laboratory, sey Institute of Technology; Dennis Shasha, sometimes reinvent things that are well- Japan New York University; Bruce Shapiro, Nation- known to statisticians. This talk is intended al Cancer Institute; Sitaram Dikshitulu, New to toss a small bridge across the gap, by Jersey Institute of Technology; Isidore Rigout- 8:55–9:20 AM summarizing some talks and trends from sos, IBM T. J. Watson Research Center; Density-Connected Sets and their the 1997 Joint Statistical Meetings, held in Kaizhong Zhang, University of Western On- Anaheim, California just before KDD-97. Application for Trend Detection in tario, Canada Spatial Databases 11:30–12:30 PM Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, University of Mu- Panel Discussion: Validity of Data 3:00–3:30 PM nich, Germany Mining Results JAM: Java Agents for Meta-Learn- Moderator: David Hand, The Open Univer- ing over Distributed Databases sity, United Kingdom Salvatore Stolfo, Andreas L. Prodromidis, 9:20–9:45 AM Panelists: Jim Hodges, University of Min- Shelley Tselepis, Wenke Lee, and Dave W. Mining Association Rules with Item nesota; Reza Nakhaeizadeh, Daimler-Benz Fan, Columbia University; Philip K. Chan, Constraints AG, Germany; Gregory Piatetsky-Shapiro, Florida Institute of Technology Ramakrishnan Srikant, Quoc Vu, and GTE Laboratories; Arno Siebes, CWI, The Rakesh Agrawal, IBM Almaden Research Netherlands; and Padhraic Smyth, Universi- Center ty of California, Irvine 3:30–4:00 PM Coffee Break

9:45–10:15 AM Data mining methods typically rely on large-scale search for patterns or models. 4:00–6:00PM Coffee Break How can statistically valid results be ensured Invited Talk/Paper in this search-intensive approach? What 10:15 AM–12:30 PM kind of statistical tests are appropriate? How Presentation/Awards Presentation Keynote Address/Invited can one reduce overfitting? What techniques 4:00–4:30 PM Talks/Panel Session exist for efficient and data-adaptive sam- pling strategies for specific classes of algo- Invited Talk: Highlights from the rithms? How can one combine statistical Graphical Modeling Workshop 10:15–11:00 AM David Madigan, University of Washington Keynote Address: Are Spatial Data and non-statistical measures of the utility of discovered patterns? These and other ques- Special: A Data Mining tions from the audience will be discussed by Perspective? This talk will provide a very brief tutorial the panelists. introduction to basic concepts of graphical Raymond Ng, University of British models. A survey of current research topics Columbia, Canada 12:30–2:00 PM will follow, highlighting in particular the Lunch work presented at the June 1997 AMS-IMS- From a data mining standpoint, Ng will dis- SIAM workshop on graphical models. cuss how spatial data are similar to and dif- 2:00–3:30 PM ferent from conventional alphanumeric da- 4:30–5:00 PM ta. He will use characteristic and Paper Session Knowledge = Concepts: A Harmful discriminant rules as examples, and discuss Applications Equation how the characteristics of spatial data affect Jan M. Zytkow, Wichita State University and the ways these rules can be found. Finally, 2:00–2:30 PM Polish Academy of Sciences, Poland Ng will comment on the generalization Detecting Atmospheric Regimes from spatial data to discuss whether tempo- Using Cross-Validated Clustering 5:00–6:00 PM ral data or image data are special. Padhraic Smyth, University of California, Irvine; Michael Ghil and Kayo Ide, Universi- Awards Presentation ty of California, Los Angeles; Joe Roden, Jet

KDD–  Sunday, August 17th

8:30–9:45 AM 11:05–11:30 AM 2:00–3:15 PM Paper Session Using General Impressions to Paper Session Methodology I Analyze Discovered Classification Tools/Environments for DM Rules

8:30–8:55 AM Bing Liu, Wynne Hsu, and Shu Chen, Na- 2:00–2:25 PM A Probabilistic Approach to Fast tional University of Singapore, Singapore An Interactive Visualization Envi- Pattern Matching in Time Series ronment for Data Exploration Databases 11:30 AM–12:30 PM Mark Derthick, John Kolojejchick, and Eamonn Keogh and Padhraic Smyth, Uni- Panel Discussion: Data Mining and Steven F. Roth, Carnegie Mellon University versity of California, Irvine the Web

Moderator: Usama Fayyad, Microsoft Re- 2:25–2:50 PM search 8:55–9:20 AM Visualization Techniques to Explore Discriminative Versus Data Mining Results for Document Informative Learning Panelists: Sue Dumais, Bellcore; Ronen Feld- Collections man, Bar-Ilan University, Israel; Marti Y. Dan Rubinstein and Trevor Hastie, Stan- Ronen Feldman, Bar-Ilan University, Israel; Hearst, Xerox PARC; Haym Hirsh, Rutgers ford University Willi Klösgen, German National Research University; Willi Kloesgen, GMD German Center for Information Technology, Ger- National Research Center; Kamran Parsaye, many; Amir Zilberstein, Bar-Ilan University, 9:20–9:45 AM Information Discovery Inc.; and Michael Israel Analysis and Visualization of Clas- Pazzani, University of California, Irvine sifier Performance: Comparison under Imprecise Class and Cost The Web is becoming the premium source 2:50–3:15 PM Distributions of information for a growing number of Anytime Exploratory Data Analysis Foster Provost and Tom Fawcett, NYNEX people, and Web-based systems are increas- for Massive Data Sets Science and Technology ingly applied for various applications, Padhraic Smyth, University of California, whether for commercial transactions or for Irvine and David Wolpert, IBM Almaden a group organizing and performing its col- Research Center 9:45–10:15 AM laboration. Two challenges are predominant Coffee Break for data mining on the Web. The first goal is 3:15–4:00 PM to help users in finding useful information Open Forum: 10:15 AM–12:30 PM on the Web and in discovering knowledge Comments on KDD-97 Paper Session about a domain that is represented by a col- lection of Web-documents. The second goal Methodology II/Invited Panel is to analyze the transactions run in a Web- based system, be it to optimize the system or 8:30AM–5:00 PM 10:15–10:40 AM to find information about the clients using Development of Multi-Criteria Newport North Room, Lobby Level the system. This panel will discuss the ap- Workshop on Issues in the Metrics for Evaluation of Data proaches, chances, and problems of apply- Mining Algorithms ing data mining techniques for the Web. Integration of Data Mining and Gholamreza Nakhaeizadeh, Daimler-Benz Data Visualization AG, Germany and Alexander Schnabl, Tech- (by invitation only) 12:30–2:00 PM nical University Vienna, Austria Organizers: Georges Grinstein, Andreas Lunch Wierse, and Usama Fayyad

10:40–11:05 AM A Visual Interactive Framework for Attribute Discretization Ramesh Subramonian, Ramana Venkata, and Joyce Chen, Intel Corporation

 KDD‒ Poster Sessions

3:00–4:20 PM Large Scale Data Mining: Chal- 4:20-5:40 PM Poster Session I lenges and Responses Poster Session II Authors A-G Jaturon Chattratichat, John Darlington, Authors H-O Moustafa Ghanem, Yike Guo, Harald Hün- Discovery of Actionable Patterns in ing, Martin Köhler, Janjao Sutiwaraphun, Deep Knowledge Discovery from Databases: The Action Hierarchy Hing Wing To, and Dan Yang, Imperial Col- Natural Language Texts Approach lege of London, United Kingdom Udo Hahn and Klemens Schnattinger, Gediminas Adomavicius, New York Univer- Freiburg University, Germany sity and Alexander Tuzhilin, Stern School of Using Artificial Intelligence Plan- Business, New York University ning to Automate Science Data Integrating and Mining Distributed Analysis for Large Image Databases Customer Databases Partial Classification Using Associa- Steve Chien, Forest Fisher, and Helen Ira J. Haimowitz, Özden Gür-Ali, and Henry tion Rules Mortensen, Jet Propulsion Laboratory, Cali- Schwarz, General Electric Corporate Re- Kamal Ali and Stefanos Manganaris, IBM fornia Institute of Technology; Edisanter Lo search and Development Global Business Intelligence Solutions; Ra- and Ronald Greeley, Arizona State University makrishnan Srikant, IBM Almaden Research GA-Based Rule Enhancement in Center Mining Multivariate Time-Series Concept Learning Sensor Data to Discover Behavior Jukka Hekanaho, Turku Center for Comput- Increasing the Efficiency of Data Envelopes er Science and Åbo Akademi University, Fin- Mining Algorithms with Breadth- Dennis DeCoste, Jet Propulsion Laboratory, land First Marker Propagation California Institute of Technology John M. Aronis, University of Pittsburgh and Target-Independent Mining for Sci- Foster J. Provost, NYNEX Science and Tech- Why Does Bagging Work? A entific Data: Capturing Transients nology Bayesian Account and its and Trends for Phenomena Mining Implications Thomas H. Hinke, John Rushing, Heggere Brute-Force Mining of High-Confi- Pedro Domingos, University of California, Ranganath, and Sara J. Graves, University of dence Classification Rules Irvine Alabama in Huntsville Roberto J. Bayardo, Jr., The University of Texas at Austin Fast Committee Machines for Zeta: A Global Method for Dis- Regression and Classification cretization of Continuous Variables Applying Data Mining and Ma- Harris Drucker, Monmouth University and K. M. Ho and P. D. Scott, University of Essex, chine Learning Techniques to Sub- ATT United Kingdom marine Intelligence Analysis Ulla Bergsten, Johan Schubert, and Per A Guided Tour through the Data Adjusting for Multiple Compar- Svensson, Defence Research Establishment, Mining Jungle isons in Decision Tree Pruning Sweden Robert Engels, University of Karlsruhe, Ger- David Jensen and Matt Schmill, University many; Guido Lindner, Daimler Benz AG, of Massachusetts, Amherst Process-Based Database Support Germany; and Rudi Studer, University of for the Early Indicator Method Karlsruhe, Germany SIPping from the Data Firehose Christoph Breitner and Jörg Schlösser, Uni- George H. John, IBM Almaden Research versity of Karlsruhe, Germany; Rüdiger Maximal Association Rules: Center and Brian Lent, Stanford University Wirth, Daimler-Benz AG, Germany A New Tool for Mining for Keyword Co-occurrences in Mining Generalized Term MineSet: An Integrated System for Document Collections Associations: Count Data Mining Ronen Feldman, Yonatan Aumann, Ami- Propagation Algorithm Cliff Brunk, James Kelly, and Ron Kohavi, hood Amir, and Amir Zilberstein, Bar-Ilan Jonghyun Kahng, Wen-Hsiang Kevin Liao, Silicon Graphics, Inc. University, Israel; Willi Kloesgen, German and Dennis McLeod, University of Southern National Research Center for Information California Proposal and Empirical Compari- Technology, Germany son of a Parallelizable Distance- Metarule-Guided Mining of Multi- Based Discretization Method Improving Scalability in a Scientific Dimensional Association Rules Us- Jesús Cerquides and Ramon López de Màn- Discovery System by Exploiting ing Data Cubes taras, Spanish Council for Scientific Research, Parallelism Micheline Kamber, Jiawei Han, and Jenny Y. Spain Gehad Galal, Diane J. Cook, and Lawrence Chiang, Simon Fraser University, Canada B. Holder, University of Texas at Arlington

KDD–  Poster Sessions

Scalable, Distributed Data KESO: Minimizing New Algorithms for Fast Discovery Mining–An Agent Architecture Database Interaction of Association Rules Hillol Kargupta, Ilker Hamzaoglu, and Brian Arno Siebes and Martin L. Kersten, CWI, M. J. Zaki, S. Parthasarathy, M. Ogihara, Stafford, Los Alamos National Laboratory The Netherlands and W. Li, University of Rochester

Clustering Sequences of Complex Learning to Extract Text-Based In- Fast and Intuitive Clustering of Objects formation from the World Wide Web Documents A. Ketterlin, LSIIT, France Web Oren Zamir, Oren Etzioni, Omid Madani, Stephen Soderland, University of Washing- and Richard M. Karp, University of Wash- A Unified Notion of Outliers: ton ington Properties and Computation Edwin M. Knorr and Raymond T. Ng, Uni- Image Feature Reduction through KDD Process Planning versity of British Columbia, Canada Spoiling: Its Application to Multi- Ning Zhong, Yamaguchi University, Japan; ple Matched Filters for Focus of At- Chunnian Liu, Beijing Polytechnic Universi- Mining for Causes of Cancer: tention ty, China; Yoshitsugu Kakemoto, The Univer- sity of Tokyo, Japan; Setsuo Ohsuga, Waseda Machine Learning Experiments at Timothy M. Stough and Carla E. Brodley, University, Japan Various Levels of Detail Purdue University Stefan Kramer, Austrian Research Institute for Artificial Intelligence, Austria; Bernhard Autonomous Discovery of Reliable Optimal Multiple Intervals Pfahringer, University of Waikato, New Exception Rules Discretization of Continuous At- Zealand; and Christoph Helma, University Einoshin Suzuki, Yokohama National Uni- tributes for Supervised Learning of Vienna, Austria versity, Japan D. A. Zighed, R. Rakotomalala, and F. Fes- chet, University of Lyon 2, France Trellis Graphics Displays: A An Efficient Algorithm for the Multi-Dimensional Data Incremental Updation of Associa- A Dataset Decomposition Visualization Tool for Data Mining tion Rules in Large Databases Approach to Data Mining and R. Douglas Martin, MathSoft, Inc. Shiby Thomas, Sreenath Bodagala, Khaled Machine Discovery Alsabti, and Sanjay Ranka, University of Blaz Zupan and Marko Bohanec, Institute Fast Robust Visual Data Mining Florida Jozef Stefan, Slovenia; Ivan Bratko, Institute Ted Mihalisin, Temple University and John Jozef Stefan and University of Ljubljana, Timlin, Mihalisin Associates, Inc. Bayesian Inference for Identifying Slovenia; Bojan Cestnik, Temida and Insti- tute Jozef Stefan, Slovenia Solar Active Regions 5:40–7:00 PM Michael Turmon and Saleem Mukhtar, Jet Poster Session III Propulsion Laboratory, California Institute of Technology; Judit Pap, University of Cali- Authors P-Z fornia, Los Angeles Beyond Concise and Colorful: Learning Intelligible Rules Schema Discovery for Michael J. Pazzani, Subramani Mani, and Semistructured Data W. Rodman Shankle, The University of Cali- Ke Wang and Huiqing Liu, National Univer- fornia, Irvine sity of Singapore, Singapore

Scaling Up Inductive Algorithms: Selecting Features by Vertical An Overview Compactness of Data Foster Provost, NYNEX Science and Technol- Ke Wang and Suman Sundaresh, National ogy and Venkateswarlu Kolluri, University of University of Singapore, Singapore Pittsburgh Knowledge Discovery in Visualizing Bagged Decision Trees Integrated Call Centers: J. Sunil Rao, Cleveland Clinic and William A Framework for Effective J.E. Potts, SAS Institute Inc. Customer-Driven Marketing Paul Xia, EIS International Inc.

 KDD‒ Exhibitors

AcknoSoft These algorithms cover a full spectrum ing Suite™ is highly scaleable and ac- of data mining technologies and enable cesses large SQL databases directly Contact: Michel Manago analyses ranging from classification and without sampling or extracts. 58 rue du dessouse des berges predictive modeling to association dis- 75013, Paris France covery and database segmentation. Kluwer Academic Publishers Tel: 331 44 24 88 00 They can run on small workstations but Fax: 331 44 24 88 66 Contact: Adam Chesler are highly scaleable since they have par- PO Box 358, Accord Station Email: [email protected] allel implementations optimized to Web: www.acknosoft.com Hingham, MA 02018-0358 handle large and parallel super comput- Tel: (617) 871-6600 ers and databases. Along with a com- Fax: (617) 871-6528 The KATE suite includes KATE prehensive set of data analysts and ap- datamining to generate automatically Email: [email protected] plication developers through IBM’s Web: www.wkap.nl20 decision trees from cases, KATE-CBR to flagship data mining technology prod- retrieve cases that are similar to a query uct, the Intelligent Miner. Leveraging and reason by analogy, KATE-editor to Kluwer Publishers will have the journal the Intelligent Miner technology, IBM Data Mining and Knowledge Discov- build a database that is object-oriented has developed a collection of applica- or link to most existing market databas- ery (Editors-in-chief: Usama Fayyad, tions, Business Discovery Solutions Heikki Mannila, and Gregory Shapiro- es. As an option, users may add KATE (BDS), to make data mining more ac- for R/3 Service Management to link to Piatetsky) on display! Many other fine cessible to business users. Through an journals are also available for review, as SAP and KATE WebServer to access the easy to use Java GUI, BDS addresses decision support system over an in- well as over 25 new books, discounted business problems such as customer re- 20% for KDD ‘97 attendees! tranet or the internet. Written in C, tention, risk analysis, store layout opti- KATE is also available as dynamic link mization, and cross selling. IBM’s top- MathSoft, Inc. libraries in Windows or as shared li- caliber data mining analysts have braries on Unix (Sun Solaris). The extensive industry expertise and have Contact: Tina Styer KATE suite is tailor made for analyzing helped more than 60 companies exploit Data Analysis Products Division complex data in technical domains. Ap- the new developments in data mining. 1700 Westlake Ave. N. #500 plications of KATE include mainte- IBM, on whose machines 70% of the Seattle, WA 98109 nance of Boeing 737 engines, of marine world’s data reside, supports all the Tel: 800-569-0123 diesel engines, of Naples’ metro, of the components required to guarantee a Email: [email protected] French high speed train TGV, help desk customer’s success in data mining. Web: www.mathsoft.com for robots in the plastic industry, tele- com networks, CAD/CAM worksta- Information Discovery, Inc. MathSoft, Inc., is a leading provider of tions, reliability analysis of offshore knowledge discovery software, with Contact: Pamela Lerwick platforms, nuclear power plants, indus- more than a million users worldwide. 703B Pier Avenue, Suite 169 trial gas meters, the Ariane space center, Mathsoft has a flexible family of prod- Hermosa Beach, CA 90254 quality management in manufacturing, ucts, offering solutions for data analysis, Tel: (310) 937-3600 evaluation of costs for manufacturing statistical data mining and decision sup- Fax: (310) 937-0967 plastic parts, sales support for electronic port. S-Plus is the premier solution for Email: [email protected] devices. powerful data analysis, visualization and Web: www.datamining.com1 statistical data mining. S-Plus offers the IBM Corporation richest data analysis environment avail- The Data Mining Suite™ is a compre- able, with over 2,000 built-in functions Contact: Carolina Salcedo hensive and integrated set of data min- including both classical and robust tech- Route 100, Maildrop 3128 ing products that provide complete so- niques, all within a customizable, intu- Sommers, NY 10589 lutions for knowledge discovery, itive user interface. In addition, exclusive Tel: (914) 766-3929 predictive modeling and internet/in- TRELLIS graphics allow you to reveal Fax: (914) 766-8328 tranet-based applications within a uni- hidden meanings in complex, multidi- Email: [email protected] fied frame-work. Each product is novel mensional data. MathSoft Stat Server is and useful in its own right, but the joint a powerful new approach that delivers a Researchers at many IBM labs around application of the techniques used with- competitive edge in decision support. the world are continuously developing in the suite delivers unprecedented ben- The first enterprise-wide solution for powerful algorithms for analyzing large efits to corporate users. The Data Min- data sets stored in databases or flat files. distributing sophisticated analyses and

KDD–  Exhibitors

graphics, StatServer leverages your com- sonal computers including the Mac, services to the telecommunications, fi- pany’s existing client/server and Inter- IBM PC, neXT, Apollo, and more. PC nancial services, health care and direct net/intranet technology to put informa- AI features developments in expert sys- mail industries. tion in the hands of decision makers. tems, neural networks, object oriented Stop by to see a demonstration of the development, and all other areas of arti- Silicon Graphics power of S-Plus and StatServer. ficial intelligence. Feature articles, prod- Contact: Ron Kohavi uct reviews, real-world application sto- Mailstop 80-876 Morgan Kaufmann Publishers ries, and a Buyer’s Guide present a wide 2011 N. Shoreline Blvd. Contact: Patricia Kim range of topics in each issue. Mountain View, CA 94043 340 Pine Street, 6th Floor Tel: (650) 933-3126 San Francisco, CA 94104-3205 Salford Systems Fax: (650) 932-2874 Tel: (650) 392-2665 Contact: Kerry Martin Web: www.sgi.com27 Fax: (650) 982-2665 8880 Rio San Diego Dr. Email: [email protected] Suite 1045 MineSet™ version 2.0 is the fourth re- Web: www.mkp.com24 San Diego, CA 92108 lease of SGI’s product for exploratory Tel: (619) 543-8880 data analysis. Combining powerful in- Since 1984, Morgan Kaufmann has tegrated, interactive tools for data ac- published the finest technical informa- Salford Systems is exhibiting CART 3.0 cess and transformation, data mining, tion resources for computer and engi- (Classification and Regression Trees), a and visual data mining, MineSet pro- neering professionals. Our audience in- tree-structured data-mining tool, co- vides you with a revolutionary cludes the research and development developed with the original authors of paradigm for getting maximum value communities, information technology CART at UC Berkeley and Stanford from your vast data resources. MineSet (IS/IT) managers, and students in pro- (Breiman, Freidman, Olshen, and enables you to gain a deeper, intuitive fessional degree programs. We publish Stone). The premier decision tree tool- understanding of your data, by helping in book and digital form in such areas complete with built-in n-fold cross-val- you to discover hidden patterns, impor- as databases, computer networking, idation, user-definable variable misclas- tant trends and new knowledge. It is computer systems, human computer sification costs, linear combination this deep understanding which can be interaction, computer graphics, multi- splits, efficient handling of high-dimen- used for developing powerful business media information and systems, artifi- sional categorical predictors, and the strategies leading to greater competitive cial intelligence, and software engineer- new feature of combining multiple advantage. ing. Many of our books are considered trees-is now available in one affordable to be the definitive works in their fields. package. CART 3.0 has a completely re- SRA International Please stop by our display and receive a vised Windows graphical user interface Contact: Jim Hayden 15% Knowledge Discovery and Data and the ability to interact with the tree 4300 Fair Lakes Court Mining 1997 conference discount. after database analysis. Useful diagnos- Fairfax, VA 22033 tics include dynamic pruning, gains Tel: (703) 803-1689 PC AI Magazine charts for all sub-trees, and simultane- Fax: (703) 803-1793 Contact: Robin Okun ous viewing of training and test data ac- Email: [email protected] P.O. Box 30130 curacy scores. Experienced users can Web: www.SRA.com29 Phoenix, AZ 85046 control CART 3.0 via command scripts, Tel: (602) 971-1869 while newcomers can use point-and- SRA International has been creating in- Fax: (602) 971-2321 click menu selections. Throughout an novative solutions to practical problems Email: [email protected] interactive session using the menu, faced by businesses and government Web: www.pcai.com/pcai/26 CART 3.0 records command equiva- agencies for over eighteen years. We lents, providing an audit trail for the specialize in the fields of intelligent in- PC AI Magazine provides the informa- session. An in-depth, comprehensive formation retrieval; machine learning; tion necessary to help managers, pro- manual explains every feature and nu- knowledge-based systems; database en- grammers, executives, and other profes- ance of CART within the context of gineering; and natural language pro- sionals understand the quickly over 30 examples. Salford Systems has cessing. SRA empowers organizations unfolding realm of artificial intelligence been developing advanced tools for da- with the ability to discover and detect (AI) and intelligent applications (IA). ta analysis for PC and Unix platforms patterns critical to their success through PC AI addresses the entire range of per- since 1983, and also provides consulting the use of a complete line of scaleable

 KDD‒ Exhibitors data mining tools and professional ser- DB2, Parallel Sysplex, Sun, HP, NCR, WizSoft Ltd. vices. SRA’s KDD Toolset includes mul- Digital and Intel and supports Oracle ti-strategy algorithms for discovering Parallel Server, Informix XPS, and IBM Contact: Abraham Meidan associations, classifications, sequences, DB2/PE. Torrent Systems is headquar- 3 Beit Hillel St. and clusters, as well as high-speed rule tered in Cambridge, MA. Tel Aviv and sequence-based pattern matching Israel, 67017 algorithms. These algorithms employ Toshiba Corporation Tel: 972-2-5631948 Fax: 972-3-5611945 direct database access for mining data. Contact: Hiroshi Tsukimoto Additionally, they are parallelized to Email: [email protected] 70 Yanagi-cho, Saiwai-ku Web: www.wizsoft.com33 take advantage of multiprocessor Kawasaki platforms for rapid analysis of extreme- Japan Wiz Why for WINDOWS 95/WINDOWS NT ly large data sets. Finally, we employ a Tel: 81-44-548-5469 comprehensive set of JDBC-compliant is a data mining application for issuing Fax: 81-44-520-5856 prediction and classification. WizWhy Java-based user interfaces for configu- Email: [email protected] ration and execution of algorithms as analyzes the data, reveals all the if-then rules and mathematical formula rules, well as visualization of results for analy- NNE is a neural network based data sis and interpretation. SRA’s knowledge and calculates the significance level of mining tool. NEX (Neural network EX- each rule. WizWhy then predicts future discovery specialists understand how plainer) is the explanation module of best to apply these advanced capabili- cases based on the discovered rules. In NNE. NEX provides the explanation for empirical tests WizWhy was found to ties to enable you to utilize your most trained neural networks by extracting strategic asset: electronic information. be faster and more accurate than neural rules from the networks. Since trained networks, decision trees and genetic al- Together, SRA’s KDD Toolset and pro- neural networks are black boxes and are fessional services provide solutions that gorithms. difficult for humans to understand, the WizRule for WINDOWS 95/WINDOWS are ideal for tackling such large-scale neural network is incomplete as a tech- problems as fraud detection and pre- NT is a data cleansing and auditing nique of data mining, which aims to tool. WizRule reveals all if-then rules vention, competitive intelligence, and discover understandable knowledge market behavior. and formula rules in the database. It from databases. So NEX makes the neu- then points at the deviations from the Torrent Systems Inc. ral network a complete data mining set of all the discovered rules as suspect- technique. NEX has the following fea- ed errors, and calculates the level of un- Contact: Sondra Barrison tures. (1) NEX can be applied to any likelihood of each deviation. WizRule 5 Cambridge Center neural network including recurrent avoids false alarms; almost every devia- Cambridge, MA 02138 neural networks. (2) NEX can be ap- tion with a high level of unlikelihood is Tel: (617) 354-8484 plied to any training method. (3) NEX indeed an error. Fax: (617) 354-6767 can be applied not only to discrete val- Email: [email protected] ues but also to continuous values. (4) NEX extracts accurate and simple rules Torrent Systems, Inc. develops and in a short time. Especially, when classes markets tools, component software and are continuous, there is no other sys- applications to support system integra- tematic method which can discover un- tors and applications developers in derstandable knowledge. NEX discovers building advanced data warehousing understandable knowledge, that is, and data mining applications that run rules, from trained neural networks. on parallel processing systems. OR- The accuracies of the rules are based on CHESTRATE, Torrentis parallel devel- the trained networks. NEX can be con- opment environment, hides the com- nected to any neural network. So any plexity of parallel programming and neural network user can obtain an ex- facilitates the creation of fully parallel, planation for trained neural networks high-performance data-processing so- just by connecting NEX to their neural lutions for your MPP or SMP systems. networks. ORCHESTRATE is fully compatible with all major scalable servers including the IBM RS/6000, SP server using IBM

KDD–  Demos

An Interactive Visualization ment Explorer searches for patterns GeoMiner: A Geo-Spatial Data Environment for Data that capture relations between concepts Mining Engine of the domain. The patterns which have Exploration been verified as interesting are struc- Paper Title: Described in Proceedings of Paper Title: An Interactive Visualization tured and presented in a visual user in- SIGMOD’97 Environment for Data Exploration terface allowing the user to operate on Development Team: Jiawei Han, Development Team: Mark Derthick, the results to refine and redirect mining Krzysztof Koperski Nebojsa Stefanovic, John Kolojejchick, and Steven F. Roth, queries or to access the associated doc- and Qing Chen, Simon Fraser Universi- Carnegie Mellon University uments. The system offers preprocess- ty Intelligent Database Systems Re- ing tools to construct or refine a knowl- Telephone: 412-268-8812 search Lab edge base of domain concepts and to Telephone: (604) 291-4411 We will demonstrate an information- create an intermediate representation of centric interface architecture for unify- the document collection that will be Spatial data mining is to mine high-lev- ing the subtasks of knowledge discov- used by all subsequent data mining op- el spatial information and knowledge ery, allowing the analyst to focus on the erations. The main pattern types the from large spatial databases. A spatial process rather than the tools. These system can search for are frequent sets, data mining system prototype, Ge- subtasks include data cleaning, creating associations, concept distributions, and oMiner, has been designed and devel- a dataset, data reduction and projec- keyword graphs. To provide some ex- oped based on our years of experience tion, and exploratory visualization. Ar- plicit bias, the system provides a dedi- in the research and development of re- chitectural integration is achieved with cated query language for searching the lational data mining system, DBMiner, a shared object-oriented database, vast implicit spaces of pattern instances and our research into spatial data min- which is accessible to the user via a vi- that exist in the collection. This query ing. The data mining power of Ge- sual query language. Tight integration language offers syntactical, background, oMiner includes five spatial data min- between queries and visualizations quality and redundancy constraints. ing modules: characterization, make exploration more interactive and The query language is embedded in a comparison, association, clustering, less of a feed forward process than in GUI which makes it easy even for and classification. The SAND (Spatial previous systems. novice users to explore the document And Nonspatial Data) architecture is What Is Unique about the System? collections. applied in the modeling of spatial Tight integration of querying and visu- What Is Unique about the System? (1) databases, whereas GeoMiner includes alization. Interactive visualizations in- Dealing with unstructured information; the spatial data cube construction mod- volving attributes of multiple objects. (2) unique visualization tools; (3) spe- ule, spatial on-line analytical processing cial query language designed for text (OLAP) module, and spatial data min- Document Explorer mining; and (4) unique browsers that ing modules. A spatial data mining lan- guage, GMQL (Geo-Mining Query Paper Titles: (1) Visualization Tech- enable interactive exploration. Language), is designed and implement- niques to Explore Data Mining Results ed as an extension to Spatial SQL, for for Document Collections, and (2) spatial data mining. Moreover, an inter- Maximal Association Rules: A New Tool active, user-friendly data mining inter- for Mining for Keyword Co-Occur- face is constructed and tools are imple- rences in Document Collections mented for visualization of discovered Development Team: Ronen Feldman spatial knowledge. and Amir Zilberstein, Bar-Ilan Univer- What Is Unique about the System? A sity; Willi Kloesgen, GMD spatial data mining system performing Telephone: 972-3-5318629 or 972-3- knowledge discovery based on both 9326702 spatial and non-spatial properties of Document Explorer is a data mining objects. It also includes spatial OLAP system for document collections. Such modules and tools for visualization of a collection represents an application discovered spatial knowledge. domain, and the primary goal of the system is to derive patterns that provide knowledge about this domain. Addi- tionally, the derived patterns can be used to browse the collection. Docu-

 KDD‒ Demos

Id-Vis Interactive Knowledge Kensington - High- Paper Title: A Visual Interactive Frame- Exploration Using DBMiner Performance Distributed and work for Attribute Discretization Paper Title: Metarule-Guided Mining of Parallel Data Mining Development Team: Ramesh Subramo- Multi-Dimensional Association Rules Paper Title: Large Scale Data Mining: nian, Ramana Venkata, and Joyce Chen, Using Data Cubes Challenges and Responses Microcomputer Research Laboratory, Development Team: Jiawei Han, Jenny Y. Development Team: J. Chattratichat, J. Intel Corporation Chiang, Sonny Chee, Shan Cheng, Wan Darlington, M. Ghanem, Y. Guo, M. Telephone: (408) 653 - 6794 (Ramana Gong, Micheline Kamber, Kris Koperski, Kohler, A. Saleem, J. Sutiwaraphun, and Venkata) Yijun Lu, Nebojsa Stefanovic, Lara Win- D. Yang, Department of Computing, We will demonstrate the Discretizer stone, Betty Xia, Osmar R. Zaiane, and Imperial College module of Id-Vis, our interactive plat- Hua Zhum, Simon Fraser University Telephone: +44-171-5948360 form for visual data mining on a client- Telephone: (604) 291-4411 Kensington is a prototype of an open, server architecture. This module fea- With years of research and development distributed, web-based, high-perfor- tures multiple available algorithms, a efforts, the DBminer system developed mance data mining system for use on drag-and-drop cut-point editor, multi- in CS/SFU, Canada has incorporated parallel servers. A web-based client- ple levels of data visualization with many advanced research results into tool, written in Java, gives access to a drill-down capability etc. A number of our system. This includes multiple-level distributed collection of data bases and the variables are exposed to user experi- knowledge in large relational databases data mining modules which are execut- mentation. The system provides visual and data warehouses, a wide spectrum ed on a high-performance parallel ma- cues to the “optimal” number and loca- of data mining functions, including chine. The application consists of com- tions of the cut-points. It also provides characterization, comparison, associa- ponents which are integrated using feedback to the user about the extra tion, classification, prediction, and clus- Java/Corba middleware. The main “badness,” over the system-derived op- tering, and a data visualization package. components are: a database server, a da- tima, incurred during the experimenta- The major technologies adopted are in- ta mining server, a visualisation and tion. tegration with data warehouse and Web-server, and the client control tool. The central philosophy is that the OLAP technology, attribute-oriented The data mining servers are embedded system should place the user within the induction, statistical analysis, progres- in Corba objects and distributed across appropriate context of system-derived sive deepening for mining multiple- a LAN or WAN. The database server is values, and provide the user with the level knowledge, and meta-rule guided accessed via JDBC. The client-tool can opportunity to intelligently modify (e.g. mining. The system provides a user- access and control the data mining ac- display the impact on accuracy of these friendly, interactive data mining envi- tions from anywhere. The high-perfor- modifications) these optima and propa- ronment with good performance. mance mining modules are portable gate them downstream. The server orig- What Is Unique about the System? In- parallel implementations in C and MPI, inally ships a compacted version of the tegration with OLAP, multiple-level and cover commonly needed functions raw data, or a equi-probabilistically dis- mining and multiple data mining mod- such as classification (C4.5), prediction tilled density function to the client. If, ules. (neural networks), association discov- during the drill-down process, the us- ery and self-organizing maps. The er’s information request and the associ- light-weight client tool is enriched by ated accuracy constraints cannot be sat- visualization applets for data mining re- isfied by the locally available data, the sults, whenever they become available. client obtains the appropriate data from The overall goal is to (a) integrate dis- the server and displays it. tributed servers, e.g. data and computa- What Is Unique about the System? Vi- tion servers, and (b) to make the inte- sual interactivity; opportunity for the grated system universally accessible user to encode his or her intuition/do- over the Web. main knowledge into and during the What Is Unique about the System? The mining process; a feedback-loop Kensington architecture combines a paradigm of data mining as a learning flexible integration approach, based on process. distributed object technology, with high-performance datamining compo- nents. The component technology is

KDD–  Demos

platform-independent, and allows al clustering. S-PLUS DataBlade for In- straightforward extension of the system. What Is Unique about the System? formix Universal Server Accessibility over the Web is an inher- 49er is an autonomous knowledge dis- ent part of all components. In summa- coverer. It automatically tunes itself to Paper Title: Data Mining with Trellis ry, the key features of Kensington are: the forms of knowledge that are appro- Graphics accessibility, extensibility, distributed priate for a given dataset: equations, Development Team: Kevin Brown and object architecture, platform-indepen- contingency tables, taxonomies, deci- Jun Luo, Informix; Vikram Chalana, dence, high performance components. sion trees, and the like. 49er explores Scott Blachowicz, Marianna Clark and huge hypotheses spaces, evaluating the Doug Martin, Mathsoft Mining For Many strength (to ensure predictive power) Telephone: (206) 283-8802 x229 (Doug Kinds of Knowledge and significance of results (to prevent Martin) overfit). Paper Title: Knowledge = Concepts: A Informix Universal Server (IUS) is an Harmful Equation SONAR (System for Optimized object-relational data base. S-PLUS is Development Team: Arun Sanjeev and Numeric Association Rules) an object-oriented language and system Jan Zytkow, Wichita State University for data analysis, statistical modeling, Paper Title: Computing Optimized Rec- visualization and programming with Telephone: 316-978-3015 (Arun San- tilinear Regions for Association Rules data. An IUS DataBlade is a collection jeev) or 316-978-3925 (Jan Zytkow) Development Team: Takeshi Fukuda, Ya- of types (classes) of objects and access We will demonstrate Forty-Niner suhiko Morimoto, Hirofumi Matsuza- methods, that closely integrates appli- (49er), an automated discovery system wa, Shinichi Morishita, Takeshi Tokuya- cations software with the IUS database, that discovers knowledge in databases. ma, and Kunikazu Yoda, IBM Tokyo typically on the server. The S-PLUS dat- We will apply the system to several Research Laboratory ablade for IUS provides new data types databases to demonstrate how different in IUS corresponding to intrinsic S- Telephone: +81-462-73-4946 (Kunikazu forms of knowledge can be automati- PLUS datatypes and functions to apply Yoda) cally discovered. 49er searches for many any S-PLUS expression on these types of knowledge. It starts from con- Recent progress in technologies for data datatypes. It also provides functions to tingency tables (CTs) and then recog- input have made it easier for finance convert IUS native data to these S-PLUS nizes special types of CTs which lead to and retail organizations to collect mas- datatypes and vice-versa. The demon- other, more specialized forms of knowl- sive amounts of data and to store them stration will include several data min- edge, such as equations, equivalence re- on disk at a low cost. Such organiza- ing applications examples, including: lations, or subset relations. When many tions are interested in extracting from integrated query and statistical data relations of the same type have been these huge databases previously unno- mining, visualizing and modeling the discovered, 49er combines them into ticed information that inspires new relationship between equity returns, forms such as taxonomies and subset marketing strategies. In this demonstra- firm size and book-to-market; robust graphs. We will contrast 49er with spe- tion, we introduce a system for mining beta mining (finding firms listed on the cialized systems that are focused on a optimized association rules and for AMEX, NASDAQ and NYSE or which single form of knowledge, even if other generating decision/regression trees the beta calculation is influenced by forms of knowledge are much more ap- from databases with numeric data as outliers, and visualizing the data for propriated for a given dataset. Further, well as categorical data. such firms); hexagonal binding visual- we will contrast 49er’s focus on knowl- What Is Unique about the System? ization of scatterplots for largish data edge that contains as much of empirical Our system uses novel algorithms for sets; application of trellis graphics to a contents as possible with the focus on efficiently creating ranges and regions Lucent customer value analysis (CVA) concept definitions typical to machine with respect to various optimization study. learning. We will show how CTs can be criteria such as maximization of confi- What Is Unique about the System? The used to generate decision trees and how dence or support, and minimization of object-oriented aspects of IUS and S- 49er’s search for additional “redun- entropy and mean squared error. PLUS make the S-PLUS DataBlade a dant” knowledge makes decision trees natural marriage of the two technolo- more flexible and statistically signifi- gies: arbitrary object types can have cant. We will also contrast 49er’s ap- mirror images in the IUS data base. proach to taxonomy formation (com- Furthermore, the use of the IUS SQL93 bine many approximate equivalence is smoothly integrated with the use of relations) with statistical and conceptu- S-PLUS functions.

 KDD‒