Phenomenal Data Mining and Link Analysis
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI Technical Report FS-98-01. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. Phenomenal Data Mining and Link Analysis. Donai Lyons, Gregory S. Tseytin, School of Systemsand Data Studies, Intelligent SystemsLaboratory, University of Dublin, Trinity College, V. I. SmirnovResearch Institute, Dublin, Ireland. St. Petersburg State University, Russia. [email protected] tseytin @tseytin.spb.ru Abstract Phenomenal Data Mining. Targetedmarketing is an increasingtrend in advertising Data Miningis concernedwith the non-trivial extraction of with companiesattempting to send heavily customised previously unknownand potentially useful information from mail shots to only that subset of customersidentified databases that may be large, noisy and have missing data as likely to be interested. This requiresdetailed demo- (Piatetsky-Shapiro & Frawley 1991). Contributions have graphic data on the target groupand, whenthe customer come from both the Machine Learning community(who, as is identified by virtue of havingpaid by credit cardor a broad generalisation, have concentrated on pattern recog- ordered by mail, such data maybe readily available. nition within large but relatively clean databases) and from Whencustomers are anonymousas, for example,at a su- the Statistics community(who, as a broad generalisation, permarketcheckout, PhenomenalData Mininghas been have concentrated on understanding and modelling smaller proposedas a methodologyfor makingdemographic and lifestyle inferencesfrom analysis of capturedpoint-of- but noisier datasets). Little of this workhas focused on the sale data~ underlying phenomenathat give rise to the observed data. Plannersof Public Transportsystems require an under- Hence, muchpublished data mining work results in aggre- standingof patterns of commuterbehaviour. This is tra- gate techniques such as the construction of association rules ditionally derivedby expensivesurvey or diary methods. or Bayesian belief networks rather than in a technique such Weshow that there are underlyingsimilarities between Link Analysis which workswith individual objects and their EPOSand pre-paid ticket data captured by the on-bus attributes. Wayfarersystem used by the Dublin Bus companyand In contrast, PhenomenalData Miningattempts to find re- investigate PhenomenalData Miningas complementary lations betweenthe data and the phenomenawhich give rise low-costmethodology for analysingthis data, to that data, rather than just relations amongthe data. A Weexpect that it will be possibleto makestatistically major element is the attempt to model the underlying phe- valid identificationsof commutertrips. nomenaand their attributes. John McCarthy,in his unpub- lished working paper "PhenomenalData Mining" (McCarthy 1996), investigates Introduction. ... what can be inferred about phenomenafrom data and Phenomenal Data Mining (PDM) has been proposed, but what facts are relevant to doing this. The main techni- not proven, as a methodologyfor making demographic and cal point is that functions and predicates involving the lifestyle inferences from analysis of captured point-of-sale phenomenashould be explicit in the logical sentences (EPOS)data. In this paper, we investigate its utility for un- and not just present in the mindof the person doing the derstanding patterns of commuterbehaviour in data from an data mining. on-bus operational system. The dynamics of commuterbe- haviour is currently an active one - for a goodoverview, see ... (Mahmassani1997) and in particular his discussion of day- Science and commonsense both tell us that the facts to-day dynamics, pp 292-295. about the world are not directly observable but can be Anattractive aspect of PDMis that it identifies individual inferred from observations about the effects of actions. commuters. Whenattempting to carry this out, we found Whatpeople infer about the world is not just relations Database Visualisation methods were necessary. To help in amongobservations but relations amongentities that understanding the patterns of their behaviour, we see Link are muchmore stable than observations. For example, Analysis as being needed. These techniques are explored 3-dimensional objects are more stable than the image later in this paper. on a person’s retina, the information directly obtained Typically in Data Miningprojects a considerable propor- from feeling an object or on an image scanned into a tion of the overall project time (up to 70-80%)has to computer. devoted to the preliminary phases. In a later section we look o.. at howthis has applied to the present project. The extreme positivist philosophical view that science 68 RecordIdentifier Record Type Main Data 82 Start of Duty Date 83 Start of Journey Route Id, Scheduled Start Time, Actual Start Tune 84 End of Journey Stop Tune 85 Stage Update Stage Number and Tune 0 Pre-paid Ticket Validation ID Number, qScket Type Table 1: Record Types and Associated Data concerns relations amongobservations still influences that this database would contain someidentifiable phenom- the design of learning programs, and that’s what data ena (commuters)and the present paper describes the ongoing miners are. However,science never worked that way, work of identifying individual commutersand understanding neither do babies and neither should data mining pro- the patterns of their behaviour. grams. All obtain and use representations of the objects Although the two databases are superficially different, and use observations only as a meansto that end. there is an underlyingsimilarity between: He proposes a programmeto mine information such as is ¯ A supermarketbasket wherea single person (with a unique typically captured in an Electronic Point of Sale (EPOS)sys- identifier of checkout numberand timestamp) purchases tem, firstly, to identify and label customersby assigning the various items each identified by a product code. observed basket data to an inferred phenomenon(customer) ¯ A ’journey basket’ wherea single person (identified by and, secondly,to infer likely attributes (age, sex, family...) ticket number) takes a weenypackage of trips which are of the phenomenon.It seems useful to consider the former identified by routes and times. task as a (sometimesessential) precursor and the latter In both cases, it is the combinationof items selected that the core PDMparadigm. In completely anonymousdata, it we use as a basis for our inference. can be difficult to identify customers, howeverexploratory work in IBM, Almaden Research Laboratory, found some suggestive results in a conveniencestore database: The WayfarerSystem. The Wayfarer system is Dublin Bus’s electronic transaction 1. Packagesof functionally related items (like tooth- processing system, whichis used to collect financial and op- brush, toothpaste and mouthwash)grouped into baskets. erational data from its fleet of buses. The systemincludes: This exampleis of particular interest becausethese items ¯ Bus-mountedequipment that captures data relating to cash fell into different categories in the taxonomyused by the ticket sales, prepaid ticket validations and individual bus store chain, whichgenerally is based on classification of journeys. customer needs. ¯ Depot-based and centralised equipment onto which the 2. Manycases of similar baskets bought in succession, captured data is downloaded. probably by friends. Currently, the data is used for financial control, hardware 3. A repeated combination of school stationery items diagnostics and, to a lesser extent, marketingand route plan- which very probably was due to a local school (almost all ning. While there is recognition of the potential for a more such baskets were bought in the samestore of the chain). intensive analysis of the data, limited investigation of novel Aninteresting aspect of this workis that it did discover applications has been undertaken by Dublin Bus to date. One somerepeated shopping patterns but the underlying phenom- reason for this is that, whenthe system was implementedin ena turned out to be other than regular customers. In retro- 1989, disk storage and RAMwere at a premium. The sup- spect, that database can be seen to have been a difficult test pliers included extended data at a far-sighted request from of the core PDMparadigm, as it contained manytransient Dublin Bus, but their standard software only processed a customers. It seemsreasonable ’a priori’, that a programme highly compressedsubset of this data. Softwaremodification of using Data Miningtechniques to infer demographicsfrom wouldbe required to analyse the ID numberthat is associated EPOSdata can be carried out for manydatabases, but exper- with a pre-paid ticket. iment and algorithm development are needed. However,if During Novemberof 1997, Dublin Bus began supplying a successful, methodssuch as this would be a valuable, rela- portion of the data generated by the system to the School of tively low-cost, complementto traditional MarketResearch Systems and Data Studies, Trinity College, Dublin (TCD). data. The data, some40Mb per weekor 2 Gbper year, are received One of the current authors (Tseytin) was involved in the on a daily basis and fall into two categories: Almadenwork, the other author (Lyons) had previously been ¯ Thosethat are generated at the time of a prepaid ticket involvedin a project to developa systemfor identifying suit- validation and which contain information on an individ- able locations for Ticket Agentsfor Dublin Bus using Way- ual ticket. Nodata relating to cash ticket sales has been farer systemdata on pre-paid