StatisticStatistic MethodsMethods inin DataData MiningMining

Business Data Understanding Understanding

Data Preparation

Deployment

Modelling

Evaluation

DataData MiningMining ProcessProcess

ProfessorProfessor Dr.Dr. GholamrezaGholamreza NakhaeizadehNakhaeizadeh Short review of the last lecture

IntroductionIntroduction ƒƒLiteratureLiterature used used ƒƒWhyWhy Data ? Mining? ƒƒExamplesExamples of of lar largege databases databases ƒƒWhatWhat is is Data Data Mining? Mining? ƒƒInterdisciplinaryInterdisciplinary aspects aspects of of Data Data Mining Mining ƒ Other issues in recent data analysis: ƒ Other issues in recent data analysis: ExamplesExamples of of applications applications WebWeb Mining, Mining, Text Text Mining Mining ƒ Optimal structure of a Data Mining Team ƒ Typical Data Mining Systems ƒ Optimal structure of a Data Mining Team ƒ Typical Data Mining Systems Success factors of DM-Applications ƒ Examples of Data Mining Tools ƒƒSuccess factors of DM-Applications ƒ Examples of Data Mining Tools ƒ Predictive Modeling ƒƒComparComparisonison of of Data Data Mining Mining Tools Tools ƒ Predictive Modeling ƒ Data Mining in Business and Banking ƒƒHistoryHistory of of Data Data Mining, Mining, Data Data Mining: Mining: ƒ Data Mining in Business and Banking DataData Mining Mining r apidrapid development development ƒƒDataData Mining Mining in in Quality Quality Management Management ƒƒSomeSome European European funded funded projects projects ƒƒScientificScientific Networking Networking and and par partnershiptnership ƒƒConferencesConferences and and Journals Journals on on Data Data Mining Mining ƒƒFurFurtherther References References

2 Data Mining Process

CRISP-DM :

- Provides an overview of the life cycle of a data mining project

- Consists of six phases Business Data Understanding Understanding - was partially funded by the European Commission Data Preparation Project Partner: Deployment

Modelling

Evaluation

- CRISP-DM Process Model is described in: 3 http://www.crisp-dm.org/CRISPwP-0800.pdf Data Mining Process

CRICRISP-DM:SP-DM: Business Business Understanding Understanding

http://www.crisp-dm.org/CRISPwP-0800.pdf ••DetermineDetermine businessbusiness objectivesobjectives

••AssessAssess situationsituation

••DetermineDetermine datadata miningmining goalsgoals

••ProduceProduce projectproject planplan

4 Data Mining Process

CRISP-DM: Data Understanding CRISP-DM: Data Understanding GeneralGeneral aspects aspects

••CollectCollect initialinitial datadata

••DescribeDescribe datadata

••ExploreExplore datadata

••VerifyVerify datadata qualityquality

5 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data

CanCan the the data data be be accessed accessed effectively effectively and and efficiently efficiently ? ? --HowHow big big is is the the needed needed storage storage ? ? --HowHow long long does does it it take take to to access access the the data data ? ? ••IsIs there there any any restriction restriction in in collecting collecting the the data data ? ? --privacyprivacy issues, issues, --tootoo expensive expensive data, data, --tootoo expensive expensive collecting collecting process,.. process,.. ••……………………

6 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data whatwhat are are the the needed needed data data ? ? where where are are the the data data ? ? ExamplesExamples of of data data sources sources

UCI KDD Database Repository for large datasets used and knowledge discovery research. UCI KDD Database Repository for large datasets used machine learning and knowledge discovery research. UCI Machine Learning Repository. UCI Machine Learning Repository. Delve, Data for Evaluating Learning in Valid Experiments Delve, Data for Evaluating Learning in Valid Experiments FEDSTATS, a comprehensive source of US statistics and more FEDSTATS, a comprehensive source of US statistics and more FIMI repository for frequent itemset mining, implementations and datasets. FIMI repository for frequent itemset mining, implementations and datasets. Financial Data Finder at OSU, a large catalog of financial data sets Financial Data Finder at OSU, a large catalog of financial data sets GeneSifter Data Center, access to microarray datasets through the GeneSifter microarray data analysis system. GeneSifter Data Center, access to microarray datasets through the GeneSifter microarray data analysis system. GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval. compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval. Grain Market Research, financial data including stocks, futures, etc. Grain Market Research, financial data including stocks, futures, etc. Investor Links, includes financial data Investor Links, includes financial data Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase. Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase. MIT Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research. National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America. web sites, including countries from Africa, Europe, Asia, and Latin America. National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more. physics, life sciences, astrophysics, and more. PubGene(TM) Gene Database and Tools, genomic-related publications database PubGene(TM) Gene Database and Tools, genomic-related publications database SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments. SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments. SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site. over 1 million registered users' activities at the project management web site. STATOO Datasets part 1 and part 2 STATOO Datasets part 1 and part 2 UCR Time Series Data Mining Archive, offering datasets, papers, links, and code. UCR Time Series Data Mining Archive, offering datasets, papers, links, and code. United States Census Bureau. United States Census Bureau. 7 Source: http://www.kdnuggets.com/datasets/ Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data whatwhat are are the the n neededeeded data data ? ? ••wherewhere are are the the data data ? ? - Flat Files - Flat Files DB1 - Databases - Databases DataData Preprocessing: Preprocessing: - Heterogeneous Databases - Heterogeneous Databases • Cleaning - Connected autonomous databases • Cleaning - Connected autonomous databases DB2 - Legacy Databases • Integration Data - Legacy Databases • Integration warehouse • •TransformaTransformationtion inheritedinherited from from languages, languages, platforms, platforms, …….……. andand techni techniqquueses earlier earlier than than cu currentrrent technologytechnology -- DataData warehouse warehouse DBm

8 Data Warehouse (DWH)

IntroductionIntroduction DevDeveelopmentlopment of of DWH DWH started started in in the the beginning beginning of of 80s 80s DWHDWH is is an an enterprise- enterprise-widewide database database thatthat serves serves as as a a databsedatabse for for all all kind kind of of management management support support systems systems

Definition:Definition: SeveralSeveral definit definitionion can can be be found found for for DW DW in in the the literat literature.ure. OneOne often often used used is is due due to to W. W. H. H. Inmon: Inmon:

„„AA Data Data Wareh Warehouseouse is is a a subject subject-or-oriented,iented, integrated, integrated, timtime-variante-variant and and non non-volatile-volatile collection collection of of Data Data in in support support ofof managements managements Decision Decision support support process. process.””

TechnicalTechnical potent potentialial benef benefitsits ••IntegratedIntegrated database database systems systems f oforr management management support support ••DischargeDischarge operat operationalional data data pr processingocessing systems systems ••QuickQuick queries queries and and reports reports due due to to the the integrated integrated data data 9 Data Warehouse

DefinitionDefinition (continuous) (continuous)

99SubjectSubject-Or-Orientienteed:d: OrientedOriented to to main main subj subjectsects like like Customer, Customer, Company, Company, product, product, supplier,.. supplier,.. insteadinstead to to concentrate concentrate on on company company's's ongoing ongoing operations. operations.

99IntegratIntegrated:ed: IntegrateIntegrate data data from from different different heterogeneous heterogeneous data data sources sources RelatioRelationnalal databases databases f lfatlat f ifles….iles…. byby application application of of data data cleaning cleaning and and data data integration integration methods methods consistency consistency in in naming, naming, encodingencoding struct structureure and and attr attributibuteses measures measures is is fulf fulfilledilled

99Time-variantTime-variant : :Analysis Analysis on on temporal temporal changes changes and and dev deveelopmentslopments requiresrequires the the long-term long-term storage storage of of dat dataa in in DW; DW; theref thereforeore “t “time”ime” isis a a main main dimension dimension of of DW DW

99NonvNonvolatile:olatile: The The data data once once stored stored in in a a DW DW should should not not change change ; ; otherwiseotherwise it it is is not not possible possible to to perform perform a a realistic realistic data data analysis analysis

10 Data Warehouse

ArchitectureArchitecture DataData Marts Marts

StagiStaginngg areaarea

Operating OLAP System Sales Tools Extraction Tools

Mining Data Loading Data Purchases Operating Extraction Tools Tools Transformation Tools Warehouse System Data Cleaning

Extraction Reporting Tools Tools Customers Flat files

ETL:ETL: Extraction, Extraction, Transformation, Transformation, Loading Loading 11 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

SomeSome of of data data characterization characterization measures measures ••numbenumberr of of observations observations ••numbernumber of of attributes attributes ••numbernumber of of classes classes

••numbernumber of of observations observations per per class class (bal (balancedanced an andd uunnbbalalanancceedd cl classes)asses) ••……………………

DataData Character Characterizingizing Tool, Tool, DCT, DCT, was was dev developedeloped at at DaimlerChrysler DaimlerChryslerDDataata Mining Mining ResearchResearch Depar Departmenttment in in cooperation cooperation with with the the Universities Universities of of Karlsruhe Karlsruhe and and Leeds Leeds

12 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

••OtherOther measures measures to to characterize characterize data data

Initial Statistics Initial Statistics Example

CountCount 10001000 MeanMean 1.4071.407 MinMin 11 MaxMax 44 RangeRange 33 VarianceVariance 0.3340.334 StandardStandard Deviation Deviation 0.5780.578 StandardStandard Error Error of of Mean Mean 0.0180.018

13 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

••OtherOther measures measures to to characterize characterize data data

SkewnessSkewness IsIs a a measure measure that that determines determines the the degree degree of of asymmetryasymmetry of of a a distribution distribution

KurtosisKurtosis IsIs a a measure measure that that determines determines the the degree degree of of peakednesspeakednessoor rflatness flatness of of a a distribution distribution compared compared withwith normal normal distribution. distribution.

14 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

SkewnessSkewnessandand Kurtosis Kurtosis

http://www.csun.edu/~ata20315/psy524/docs/Psych%20524%20lecture%203%20DS.pdf 15 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data DatasetDataset Structure Structure

ƒƒObservationsObservations

•• AA dataset dataset can can be be cons consideredidered as as a a collect collectionion of of ob observationsservations

••OtherOther names names for for observation: observation: case, case, data data object, object, entity entity, ,event, event, instance, instance, p pattern,attern, point, point, record, record, sample,..sample,..

ƒƒAAttributesttributes Attributes 1 2 3 4 5 •EachEach observation observation is is described described by by one one or or several several attributes attributes • 1 • The attributes of an observation essentially define the 2 • The attributes of an observation essentially define the 3 proppropertieserties o of fth thatat o obbservatioservationn Observations 4 5 •OtherOther names names for for attributes: attributes: feature, feature, field, field, variable, variable, .. .. • 6 7 8 16 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data DatasetDataset Structure Structure

Example for a dataset: Annual Income Attributes

Income in Education Age Income three years ago 1 24552 High School 32 27026 2 88282 BSc 52 93725 3 82902 PhD 41 82356 4 39838 High School 56 36828 Observations 5 53542 PhD 32 62542 6 63826 MS 28 64882 7 82783 MA 43 89025 8 72886 High School 33 74925 9 21383 BA 37 62572 10 63552 BA 41 66427 11 62522 High School 25 63552 12 65254 PhD 56 67252 17 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data DatasetDataset Structure Structure

Example for representation of Document Data Attributes

Observations

Source: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson Addison wesley (May, 2005). Hardcover: 769 pages. ISBN: 0321321367 18 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data DatasetDataset Structure Structure

AttributeAttribute Type: Type: Attribute Attribute type type is is characterized characterized by by type type of of the the values values usedused to to measure measure it it

LevelLevel of of Measurement: Measurement: nominal, nominal, ordinal, ordinal, interval, interval, ratio ratio

{nominal,{nominal, ordinal} ordinal} Æ Æcategoricalcategorical , ,qualitative qualitative {interval,{interval, ratio} ratio} Æ Æcontinuous-valuedcontinuous-valued , ,quantitative quantitative

TheThe value value of of a a nominal-scaled nominal-scaledattributeattribute does does not not have have per per se se any any evaluativeevaluative distinction. distinction. It It is is just just enough enough to to distinguish distinguish one one observationobservation from from another: another: A=B, A=B, or or A A = = B B Example:Example: race, race, birthplace, birthplace, religious, religious, ID ID 19 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data Dataset Structure Dataset Structure AttributeAttribute type type

TheThe value value of of a a ordinal-scaled ordinal-scaledvariablevariable represents represents its its rank rank order. order. It It is is enough enough toto distinguish distinguish one one observation observation from from another: another: A=B, A=B, or or A A B B and anditsits rank: rank: A>B A>B oror A

ExampleExample (1): (1): Mineral Mineral Hardness Hardness Hardness Mineral Absolute Hardness

1 Talc (Mg3Si4O10(OH)2) 1

2 Gypsum (CaSO4·2H2O) 2

3 Calcite (CaCO3) 9

4 Fluorite (CaF2) 21

5 Apatite (Ca5(PO4)3(OH-,Cl-,F-) 48

6 Orthoclase Feldspar (KAlSi3O8) 72

7 Quartz (SiO2) 100

8 Topaz (Al2SiO4(OH-,F-)2) 200

9 Corundum (Al2O3) 400 10 Diamond (C) 1500

Source: http://en.wikipedia.org/wiki/Mohs_scale_of_mineral_hardness 20 AttributeAttribute type type Example 2: Ranking of German Soccer Teams (Bundesliga)

Rank Club

1th Bayern München 2nd Hamburger SV 3rd Bayer Leverkusen 4th Werder Bremen 5th FC Schalke 04 6th VfB Stuttgart 7th Eintracht Frankfurt 8th VfL Wolfsburg 9th Karlsruher SC 10th Hannover 96

21 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

DatasetDataset Structure Structure AttributeAttribute type type

IntervalInterval Attribute: Attribute: ••HHaveave all all the the features features of of ordinal ordinal attributes attributes ••InIn addition addition equal equal differences differences between between measurements measurements cancan be be viewed viewed as as equivalent equivalent intervals. intervals. ••DifferencesDifferencesbetweenbetween arbitrary arbitrary pairs pairs of of measurements measurements can can be be meaningfullymeaningfully compared compared ItIt is is m meaningful:eaningful: A=B, A=B, A>B A>B (A

Examples:Examples: ••TemperaturTemperatur in in Celsius Celsius or or Fahrenheit Fahrenheit (Equal (Equal differences differences represent represent equalequal differences differences in in temperature, temperature, but but 40 40 degrees degrees is is not not twice twice as as warmwarm as as 20 20 degrees). degrees). ••ZeroZero temperature temperature does does not not mean mean no no temperature temperature 22 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

AttributeAttribute type type

RatioRatio Attribute: Attribute: ••HHaveave all all the the features features of of interval interval attributes attributes ••InIn addition addition ratios ratiosareare meaningful meaningful absoultabsoult zero zero exists exists

Examples:Examples: ••Age,Age, income income , ,sales sales volume volume ••ZeroZero Age Age is is meaningful: meaningful: absence absence of of age age or or birth. birth. ••AA 60-year 60-year old old person person is is twice twice as as old old as as a a 30-year 30-year old old one one ••ZeroZero income income means means no no income income

23 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

AttributeAttribute type type

T em in co Meanigful are: Mi pe m n ra e er tu a re l H Multiplication, devision (*, /), (-), (> , < )(= ≠ ) a rd n e ss Difference (-), (> , < ), (= ≠ )

c Greater, les (> , < ), (= ≠ ) ol or Equality, inequality (= ≠ )

Source: http://www.socialresearchmethods.net/kb/measlevl.php

24 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DescribingDescribing data data

AttributeAttribute type type : : another another classification classification

•• DiscreteDiscrete Attributes Attributes –– HHaveave a a finite finite or or countable countable infinite infinite set set of of values values –– EExamples:xamples: number number of of children children , ,counts counts –– OOftenften represented represented as as integer integer variables variables –– SSpecialpecial case case of of discrete discrete attributes attributes : :binary binary attributesattributes

•• ContinuousContinuous Attributes Attributes –– HHaveave real real numbers numbers as as attribute attribute values values –– EExamples:xamples: Income, Income, sales sales , ,weight weight

25 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding

DataData Type Type

••DataData Streams Streams - Infinite volumes ••Cross-SectionCross-Section data data - Infinite volumes --DynamicallyDynamically Changing Changing --RealReal time time processing processing ••TimeTime Series Series data data • Spatial data • Panel data • Spatial data • Panel data ••SpatiotemporalSpatiotemporal data data ••TransactionTransaction data data ••SequencesSequences - Postman Routes - Postman Routes ••TextText data data - Web Click Streams - Web Click Streams ••webweb data data ••MultimediaMultimedia data data

26 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type

Example for cross-section data: Annual Income

Income in Education Age Income three years ago 1 24552 High School 32 27026 Example for time-series data: Siemens share 2 88282 BSc 52 93725 3 82902 PhD 41 82356 4 39838 High School 56 36828 5 53542 PhD 32 62542 6 63826 MS 28 64882 7 82783 MA 43 89025 8 72886 High School 33 74925 9 21383 BA 37 62572 10 63552 BA 41 66427 11 62522 High School 25 63552 12 65254 PhD 56 67252

27 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type

Example for the source of panel-data

A Representative Longitudinal Study of Private Households in the Entire Federal Republic of Germany

••TheThe SO SOEPEP is is a a wide-ranging wide-ranging representative representative l ongitudinallongitudinal study study of of private private households. households.

••ItIt provides provides information information on on all all household household members, members, consisting consisting of ofGermansGermans living living in in the the Old Old and and NewNew German German States, States, Foreigners, Foreigners, and and recent recent Immigrants Immigrants to to Germany. Germany.

••TheThe Panel Panel was was started started in in 1984. 1984. In In 2006, 2006, there there were were n nearlyearly 11,000 11,000households,households, and and moremore than than 20,000 20,000 persons persons sampled. sampled.

••SomeSome of of the the many many topics topics include include household household composition, composition, occupational occupational biographies, biographies, employment,employment, earnings, earnings, health health and and satisfaction satisfaction indicators. indicators.

••TheThe data data are are available available to to researchers researchers in in Germ Germanyany and and abroad abroad in in SPSS, SPSS, SAS, SAS, Stata, Stata, and and ASCI ASCII I formformatat for for im immedimediateate use. use. Ext Extensiveensive documentation documentation in in English English and and G Geermrmanan is is avai availablelable onli onlinne.e.

Source: http://www.diw.de/deutsch/soep/29012.html 28 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type SpatialSpatial Data Data

••knownknown also also as as geospatial geospatial data data or or geographic geographicinformationinformation

••describesdescribes the the geographic geographic location location of of features features and and boundaries boundaries on on Earth Earth

••usuallyusually stored stored as as coordinates coordinates and and topology topology

••cancan be be mapped mapped represented represented as as 2D 2D or or 3D 3D images images

••cancan be be often often accessed accessed or or analyzed analyzed through through GIS GIS (Geographic(Geographic Information Information systems) systems)

29 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type

ExampleExample for for Spatial Spatial Data: Data: US US Temperature Temperature Map Map

30 Source: http://www.wunderground.com/US/Region/US/Temperature.html Letzter Stand 05:00 AM GMT am 28. März 2008 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type SpatiotemporalSpatiotemporal Data Data

••SpatiotemporalSpatiotemporal data data describes describes the the development development andand changes changes of of Spatial Spatial data data over over the the time time

Examples:Examples: GPS-Data,GPS-Data, SatalliteSatallite images images TrafficTraffic Data Data TelecommunicationTelecommunication Data Data ….….

31 Data Mining Process

Data Type CRISP-DM:CRISP-DM: Data Data Understanding Understanding Data Type

Example for the source of spatial data EDC : EROS Data Center USGS : U.S.Geological Survey Earth Explorer Geospatial Data One-Stop Seamless Data Distribution Center"> Geodata Explorer Publications and Data Products National Mapping Information Cartographic Data Products, Information, and Services Geologic Data Data Standards Water Resources Data FGDC : Federal Geographic Data Committee U.S. GeoData FTP file access - DEM, DLG, LULC Manual of Federal Geographic Data Product CENSUS BUREAU SDTS : Spatial Data Transfer Standard TIGER Database NGDC : National Geospatial Data Clearinghouse 2000 U.S. Census Data Popular Digital Geospatial Data Set Collections 1990 U.S Census Data Digital Geospatial Data Set by Theme 1980 Census Data (SEEDIS) GLIS : Global Land Infomation System Data Maps 1:100,000-Scale Digital Line Graphs TIGER Map Services 1:200,000-Scale Digital Line Graphs Census State Data Centers 30 Arc-Sec. DCW Digital Elevation Models NOAA : National Oceanic and Atmospheric Administration 5 Minute Gridded Earth Topography Data NOAA Data Set Catalog Conterminous U.S. AVHRR National Geophysical Data Center (NGDC) MultiSpectral Scanner Landsat Data World Data Center System Space Shuttle Earth Observation Program National Climatic Data Center (NCDC) Thematic Mapper Landsat Data National Hurricane Center USGS Land Use and Land Cover Data National Oceanographic Data Center (NODC) Environmental Research Laboratories 32 http://ncl.sbs.ohio-state.edu/5_sdata.html Example of Web Data: A log file sample

Source: http://eprints.rclis.org/archive/00004887/01/kx05-poster_mayr.pdf 33 Example of Web Data: A log file sample

fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 ([email protected])" ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

Source: http://www.jafsoft.com/searchengines/log_sample.html

34 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DataData exploration exploration

DataData explorationexploration DataData explorationexploration ToolsTools MayMay be be useful useful ••UsingUsing descript descriptiveive dat dataa summarizat summarizationion ƒƒtoto get get the the first first insights insightsintointo the the structure structure ••UsingUsing Visualiz Visualizationation ofof data data ƒƒtoto identify identify noisy noisy data data or or outliers outliers

Source: http://www.math.yorku.ca/SCS/Gallery/ 35 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DataData exploration exploration

ToolsTools f oforr descript descriptiveive data data summar summarizatizationion

ƒƒMeasuresMeasures of of Location Location (Central (Central Tendency): Tendency):

summarizesummarize an an attribute attribute by by a a "typical" "typical" value value commoncommon measures: measures: mean, mean, median median , ,mode mode

ƒƒMeasuresMeasures of of Spread Spread (Dispersion): (Dispersion):

summarizesummarize how how much much the the observations observations of of an an attribute attribute differdiffer from from each each other other commoncommon measures measures of of spread: spread: range, range, variance, variance, averageaverage absolute absolute deviation deviation

36 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DataData exploration exploration Measures of Location:

Median (Middel Number):

(The observations should be arranged in ascending order ) MeanMean (Average): (Average): n odd Æ X = X n Med (n + 1)/ 2 X = 1/n Xi ∑ n even Æ X = 1/ 2 ( X + X ) i=1 Med n / 2 n / 2 +1

ModeMode (Modal (Modal Number) Number) : : TheThe most most frequently frequently occurring occurring attribute attribute value value

Warning:Warning: If If there there is is in in observation observation an an outlier, outlier, the the mean meanunderstatesunderstates (overstates)(overstates) the the true true value. value. In In this this case case the the median median is is a a better better measuremeasure 37 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DataData exploration exploration Measures of Spread

UnbiasedUnbiased Sample Sample Variance: Variance:

2 S =

Same mean different variance

StandardStandard Deviation: Deviation: AverageAverage Absolute Absolute Deviation Deviation isis the the positive positive square square root root of of the the variance variance n AA = 1/n Xi-m(X) AA = 1/n ∑ Xi-m(X) Range:Range: i= 1 m(x):m(x): Mean, Mean, Median Median or or Mode Mode RR = = X Xmaxmax -X-Xminmin 38 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding DataData exploration exploration

OLAP:OLAP: O Onlnlineine A Analyticalnalytical P Processingrocessing

39 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

OLAP: Online Analytical Processing OLAP: Online Analytical Processing CanCan be be considered considered asas a a pre-Analy pre-Analysissis forfor Data Data Mining Mining

Data stored in databases User can gain insight into OLAPOLAP User can gain insight into multidimensionalmultidimensional data data by by a a SoftwareSoftware varietyvariety of of possible possible views views Data Stored in flat files

isis often often a a combination combination ofof data data expl explorationoration and and tools Online: visualization tools is often integrated in Online: is often integrated in NoNo programming programming databasedatabase sy systemsstems isis needed needed FurthFurtherer d deevelovelopmenpment to of f explorativeexplorative an analyalyssisis o of f multidimenmultidimensionsionalal d daatata 40 OLAP

OLAP-CUBE:OLAP-CUBE: AnalysisAnalysis in in OLAP OLAP is is done done by by using using O OLAP-CUBESLAP-CUBES

CUBECUBE Measure: Measure: content content of of a a cel cell lcan can be be

a Number ( number of cell phones produced in ••a Number ( number of cell phones produced in EuroEuroppee in in 2000) 2000) an amount (total sales in $ of cell phones produced in ••an amount (total sales in $ of cell phones produced in EuroEuroppee in in 2000 2000 ) ) ••SometimesSometimes called called “target “target qu quanantitytity””

CubeCube Dimensions: Dimensions:

••ComparableComparable with with attri attributesbutes in in Data Data Mining Mining ••DimensionsDimensions have have nominal nominal values values (called (called categories) categories) ••DimensionDimension with with continuou continuouss categories categories have have to to be be convertedconverted to to nominal nominal categories categories ••InIn th thee reality reality, ,th thee number number of of Dimensions Dimensions is is often often moremore th thanan 3 3 (Hy (Hyppercubercubee)) 41 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

Slicing:Slicing: Selecting Selecting a a value value of of a a dimensional dimensional and and consider consider allall the the cells cells belong belong to to other other dimensions dimensions

Slice SliceSlice Asia Asia Slice ConsistConsist of of 16 16 cells cells WirelessWireless Mouse Mouse andand 16 16 measures measures 42 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

Dicing:Dicing: selecting selecting a a subset subset of of a a cube cube on on two two or or more more dimensions dimensions

DiceDice op operationeration in invovolvinglving 3 3 Dimensions: Dimensions: (Location:(Location: Asia Asia, ,Africa), Africa), (Product:(Product: Modems Modems, ,Cell Cell phones) phones) andand (Tim (Time:e: 2000, 2000, 2001) 2001)

43 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

MoreMore about about Dimensions: Dimensions: Each Each category category of of a a Dimension Dimension may may have have subcategories subcategories

44 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

Rotating (Pivoting): Rotating the axes in order to generate an alternative presentation of the data

45 Source of the cube fig. in this and the following pages: http://training.inet.com/OLAP/Cubes.htm OLAP

Roll-upRoll-up : :Aggre Aggregationgation by by c climbinglimbing up up a a categ categoryory hier hierarchyarchy

Te hr Ma an sh 1750 ha Is d tan 250 Ira bu Tu n A l rke 2000 nk 850 y ara Roll-UpRoll-Up on on location: location: Q1 1000 Q1 cities to countries 150 cities to countries Q2

own on location: Q3 Q2 DDrirlill-ld-down on location: countries to cities countries to cities Q4 Q3 TV

Q4 TV

Drill-downDrill-down : :Going Going to to more more detailed detailed da datata by by stepping stepping down down a a categ categoryory hier hierarchyarchy

46 Source of the cube : http://training.inet.com/OLAP/Cubes.htm OLAP

OtherOther capabilities capabilities and and functionalities functionalities

¾¾CalculationCalculation Engine Engine for for ••RatiosRatios ••MeanMean ••VarianVariancece ••…..…..

¾¾SupportingSupporting functional functional modelingmodeling for: for: ••ForecastForecastinging ••TrendTrend analysis analysis ••OthOtheerr statistical statistical computations computations andand tests tests

47 OLAP

OtherOther systems systems

¾¾ROLAP:ROLAP: Relational Relational OLAP OLAP • OLAP software based on relational ¾¾HOLAP:HOLAP: Hy Hybridbrid OLAP OLAP • OLAP software based on relational • Such systems combine ROLAP and datadata bases bases • Such systems combine ROLAP and • They have greater scalability MOLAPMOLAP technologies technologies • They have greater scalability • They benefit from the high scalability thanthan MOLA MOLAPP but but less less efficiency efficiency • They benefit from the high scalability ofof ROLAP ROLAP sy systemsstems and and faster faster compucomputatiotationn o of fMOLAP MOLAP sy systemsstems

¾¾MOLAP:MOLAP: Multidimen Multidimensionsionalal OLAP OLAP • OLAP software based on multidimensional • OLAP software based on multidimensional ¾ OLAM: Online Analytical Mining datadata m modelsodels ¾ OLAM: Online Analytical Mining • Mapping multidimensional views directly ••IntegrationIntegration of of OLA OLAPP with with • Mapping multidimensional views directly Data Mining toto d dataata cub cubee array array structures structures Data Mining ••RelatedRelated to to th thee co concepncept t “in-database“in-database Mining” Mining”

48 Data Mining Process

CRISP-DM:CRISP-DM: Data Data Understanding Understanding VerifyingVerifying data data quality quality

TheThe real real world world data data are are often often “dirty”, “dirty”, data data “Cleaning” “Cleaning”isis needed needed

••AreAre data data accurate accurate ? ? --noisynoisy data data

••AreAre data data complete complete ? ? --missingmissing values values

••AreAre data data consistent consistent ? ? --CodingCoding Errors Errors

49 Short review of business and data understanding

ƒƒCollectCollect initial initial data data --CanCan the the data data be be accessed accessed effect effectivelyively and and eff efficientlyiciently ? ? --IsIs there there any any restr restrictioniction in in collecting collecting t hthee data data ? ? --whatwhat ar aree the the needed needed data data ? ? wh whereere are are the the data data ? ? --ExamplesExamples of of data data sources sources --DataData w warehousearehouse

ƒƒDescribeDescribe data data --SomeSome of of data data characterizat characterizationion measures measures --DataData Structure Structure Observation,Observation, attribute attribute ty typepe (nomin (nominal,al, ordinal, ordinal, interval, interval, ratio, ratio, qua qualitative,litative, quantitative, quantitative, discrete) discrete) DataData T Type:ype: Cross-section Cross-section d data,ata, time time series series data, data, panel panel data, data, spatial spatial data… data…

ƒƒExploreExplore data data --DataData exploration exploration Tools Tools UsingUsing descriptive descriptive data data sum summarizationmarization (mean, (mean, median, median, modus, modus, variance,…) variance,…) --UsingUsing Visualiz Visualizationation --OLAPOLAP

ƒƒVerifyVerify data data quality quality --AreAre dat dataa accurate accurate ? ? --AreAre dat dataa complete complete ? ? --AreAre data data consistent consistent ? ?