Data Understanding CRISP-DM: Data Understanding Data

Data Understanding CRISP-DM: Data Understanding Data

StatisticStatistic MethodsMethods inin DataData MiningMining Business Data Understanding Understanding Data Preparation Deployment Modelling Evaluation DataData MiningMining ProcessProcess ProfessorProfessor Dr.Dr. GholamrezaGholamreza NakhaeizadehNakhaeizadeh Short review of the last lecture IntroductionIntroduction LiteratureLiterature used used WhyWhy Data Data Mining? Mining? ExamplesExamples of of lar largege databases databases WhatWhat is is Data Data Mining? Mining? InterdisciplinaryInterdisciplinary aspects aspects of of Data Data Mining Mining Other issues in recent data analysis: Other issues in recent data analysis: ExamplesExamples of of applications applications WebWeb Mining, Mining, Text Text Mining Mining Optimal structure of a Data Mining Team Typical Data Mining Systems Optimal structure of a Data Mining Team Typical Data Mining Systems Success factors of DM-Applications Examples of Data Mining Tools Success factors of DM-Applications Examples of Data Mining Tools Predictive Modeling ComparComparisonison of of Data Data Mining Mining Tools Tools Predictive Modeling Data Mining in Business and Banking HistoryHistory of of Data Data Mining, Mining, Data Data Mining: Mining: Data Mining in Business and Banking DataData Mining Mining r apidrapid development development DataData Mining Mining in in Quality Quality Management Management SomeSome European European funded funded projects projects ScientificScientific Networking Networking and and par partnershiptnership ConferencesConferences and and Journals Journals on on Data Data Mining Mining FurFurtherther References References 2 Data Mining Process CRISP-DM : - Provides an overview of the life cycle of a data mining project - Consists of six phases Business Data Understanding Understanding - was partially funded by the European Commission Data Preparation Project Partner: Deployment Modelling Evaluation - CRISP-DM Process Model is described in: 3 http://www.crisp-dm.org/CRISPwP-0800.pdf Data Mining Process CRICRISP-DM:SP-DM: Business Business Understanding Understanding http://www.crisp-dm.org/CRISPwP-0800.pdf ••DetermineDetermine businessbusiness objectivesobjectives ••AssessAssess situationsituation ••DetermineDetermine datadata miningmining goalsgoals ••ProduceProduce projectproject planplan 4 Data Mining Process CRISP-DM: Data Understanding CRISP-DM: Data Understanding GeneralGeneral aspects aspects ••CollectCollect initialinitial datadata ••DescribeDescribe datadata ••ExploreExplore datadata ••VerifyVerify datadata qualityquality 5 Data Mining Process CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data CanCan the the data data be be accessed accessed effectively effectively and and efficiently efficiently ? ? --HowHow big big is is the the needed needed storage storage ? ? --HowHow long long does does it it take take to to access access the the data data ? ? ••IsIs there there any any restriction restriction in in collecting collecting the the data data ? ? --privacyprivacy issues, issues, --tootoo expensive expensive data, data, --tootoo expensive expensive collecting collecting process,.. process,.. ••…………………… 6 Data Mining Process CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data whatwhat are are the the needed needed data data ? ? where where are are the the data data ? ? ExamplesExamples of of data data sources sources UCI KDD Database Repository for large datasets used machine learning and knowledge discovery research. UCI KDD Database Repository for large datasets used machine learning and knowledge discovery research. UCI Machine Learning Repository. UCI Machine Learning Repository. Delve, Data for Evaluating Learning in Valid Experiments Delve, Data for Evaluating Learning in Valid Experiments FEDSTATS, a comprehensive source of US statistics and more FEDSTATS, a comprehensive source of US statistics and more FIMI repository for frequent itemset mining, implementations and datasets. FIMI repository for frequent itemset mining, implementations and datasets. Financial Data Finder at OSU, a large catalog of financial data sets Financial Data Finder at OSU, a large catalog of financial data sets GeneSifter Data Center, access to microarray datasets through the GeneSifter microarray data analysis system. GeneSifter Data Center, access to microarray datasets through the GeneSifter microarray data analysis system. GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval. compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval. Grain Market Research, financial data including stocks, futures, etc. Grain Market Research, financial data including stocks, futures, etc. Investor Links, includes financial data Investor Links, includes financial data Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase. Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research. National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America. web sites, including countries from Africa, Europe, Asia, and Latin America. National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more. physics, life sciences, astrophysics, and more. PubGene(TM) Gene Database and Tools, genomic-related publications database PubGene(TM) Gene Database and Tools, genomic-related publications database SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments. SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments. SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site. over 1 million registered users' activities at the project management web site. STATOO Datasets part 1 and part 2 STATOO Datasets part 1 and part 2 UCR Time Series Data Mining Archive, offering datasets, papers, links, and code. UCR Time Series Data Mining Archive, offering datasets, papers, links, and code. United States Census Bureau. United States Census Bureau. 7 Source: http://www.kdnuggets.com/datasets/ Data Mining Process CRISP-DM:CRISP-DM: Data Data Understanding Understanding CollectingCollecting initial initial data data whatwhat are are the the n neededeeded data data ? ? ••wherewhere are are the the data data ? ? - Flat Files - Flat Files DB1 - Databases - Databases DataData Preprocessing: Preprocessing: - Heterogeneous Databases - Heterogeneous Databases • Cleaning - Connected autonomous databases • Cleaning - Connected autonomous databases DB2 - Legacy Databases • Integration Data - Legacy Databases • Integration warehouse • •TransformaTransformationtion inheritedinherited from from languages, languages, platforms, platforms, …….……. andand techni techniqquueses earlier earlier than than cu currentrrent technologytechnology -- DataData warehouse warehouse DBm 8 Data Warehouse (DWH) IntroductionIntroduction DevDeveelopmentlopment of of DWH DWH started started in in the the beginning beginning of of 80s 80s DWHDWH is is an an enterprise- enterprise-widewide database database thatthat serves serves as as a a databsedatabse for for all all kind kind of of management management support support systems systems Definition:Definition: SeveralSeveral definit definitionion can can be be found found for for DW DW in in the the literat literature.ure. OneOne often often used used is is due due to to W. W. H. H. Inmon: Inmon: „„AA Data Data Wareh Warehouseouse is is a a subject subject-or-oriented,iented, integrated, integrated, timtime-variante-variant and and non non-volatile-volatile collection collection of of Data Data in in support support ofof managements managements Decision Decision support support process. process.”” TechnicalTechnical potent potentialial benef benefitsits ••IntegratedIntegrated database database systems systems f oforr management management support support ••DischargeDischarge operat operationalional data data pr processingocessing systems systems ••QuickQuick queries queries and and reports reports due due to to the the integrated integrated data data 9 Data Warehouse DefinitionDefinition (continuous) (continuous) 99SubjectSubject-Or-Orientienteed:d: OrientedOriented to to main main subj subjectsects like like Customer, Customer, Company, Company, product, product, supplier,.. supplier,.. insteadinstead to to concentrate concentrate on on company

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    50 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us