Large Scale Data Mining Using Genetics-Based Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Jaume Bacardit! Large Scale Data Mining using • Did my PhD in evolutionary learning! Genetics-Based Machine Learning • Postdoc in Protein Structure Prediction 2005-2007! Jaume Bacardit* Xavier Llorà! " " • Since 2008 lecturer in Bioinformatics at the School of Computer Science " Google Inc. 1600 ! University of Nottingham! University Nottingham" Amphitheatre Pkwy ! Nottingham, UK" Mountain View, CA 94043! • Research interests! " [email protected]! ! – Large-scale data mining! ! ! [email protected]! GECCO 2013 Tutorial! ! – Biodata mining! ! ! Review! paper based on this tutorial: WIREs Data Mining Knowledge! Discovery 2013, 3: 37–61 doi: 10.1002/widm.1078! ! ! Slides at http://www.slideshare.net/jaumebp! Copyright is held by the author/owner(s). GECCO’13 ! Companion, July 6–10, 2013, Amsterdam, The ! Netherlands. ACM 978-1-4503-1964-5/13/07. * Presenter Machine Learning and Data Mining! What Will We Cover?! § Core of Data Mining è Machine • What does large scale mean?! learning: How to construct programs • Evolution as massive parallel processing! New! that automatically learn from Instance! • The challenges of data mining! experience [Mitchell, 1997]! • Kaleidoscopic large scale data mining ! • Real examples! • Summary and further directions! Training" Learning" Models! Inference " Set! Algorithm! Engine! Annotated" Instance! 741 What Does Large Scale Mean?! • Many scientific disciplines are currently experiencing a massive “data deluge”! • Vast amounts of data are available thanks to initiatives such as the human genome project or WHAT DOES LARGE SCALE the virtual human physiome! • Data mining technologies need to deal with large MEAN?! volumes of data, scale accordingly, extract Evolution as massive parallel processing! accurate models, and provide new insight! The challenges of data mining! Kaleidoscopic large scale data mining ! • So, what does large mean?! Real-world examples! Wrapping up! Large Meaning… Piles of Records! Large Meaning… Piles of Records! • Datasets with a high number of records! • Datasets with a high number of records! – This is probably the most visible dimension of large – Not all data comes from the natural sciences! scale data mining! – Netflix Prize: ! !– GenBank (the • Generating better movie " !genetic!! sequences recommending methods " database from the from customer ratings! NIH) contains (Apr, • Training set of 100M ratings" 2011) more than from over 480K customers " on 78K movies! 135 million gene • Data collected from October " sequences and 1998 and December, 2005! more than 126 • Competition lasted from! billion nucleotides! 2006 to 2009! ! • Think big: Twitter, Facebook? ! ! !! 742 Large Meaning… High Dimensionality! Large Meaning… Rare! • High dimensionality domains! • Class unbalance! – Sometimes each record is characterized by hundreds, thousands – Challenge to generate accurate classification models (or even more) features! where not all classes are equally represented! – Microarray technology (as many other – Contact Map prediction " post-genomic data generation techniques) can routinely generate datasets (briefly explained " records with tens of thousands of later in the tutorial) routinely " variables! contain millions of instances " – Creating each record is usually very from which less than 2% are" costly, so datasets tend to have a very positive examples! small number of records. This unbalance between number of records – Tissue type identification is " and number of variables is yet another highly unbalance—see figure! challenge ! !! (Llora, Priya, Bhargava, 2009) (Reinke, 2006, Image licensed under Creative Commons) Evolution and Parallelism! • Evolutionary algorithms are parallelism rich! • A population is data rich (individuals)! • Genetic operators are highly parallel operations! What does large scale mean?! Ind. 1 Ind. 1 Ind. 1 Ind. 1 EVOLUTION AS MASSIVE Ind. 2 Ind. 2 Ind. 2 Ind. 2 PARALLEL PROCESSING! The challenges of data mining! Kaleidoscopic large scale data mining ! Ind. n Ind. n Ind. n Ind. n Real-world examples! Wrapping up" ! evaluation selection crossover 743 Operations and Their Dependencies! Other Perks! • No dependencies è embarrassing parallelism! • Evaluation can be costly! – Fitness evaluation! • Some evolutionary models ! – Each individual can be evaluated simultaneously! – Mimic natural evolution introducing spatial relations (remember • Weak dependencies è synchronization points! Darwin’s islands?)! – Crossover! – Are model after decentralized models (cellular automata like)! – Once the parents are available the operator can be applied! • Based on the nature of evolutionary algorithms and the • Strong dependencies è careful inspection (bottlenecks)! above ingredients there multiple parallelization models has – Selection! been proposed (Cantu-Paz, 2000; Alba, 2005)! – The complete population needs to be available! – The wrong implementation can introduce large serial execution chunks! But?! • What about the data?! What does large scale mean?! Evolution as massive parallel processing! THE CHALLENGES OF DATA MINING! Kaleidoscopic large scale data mining ! Real-world examples! Wrapping up" " " ! 744 The Challenges of Data Mining! The Challenges of Data Mining! • We have seen in the previous slides how • Holding the data is the first bottleneck that large- evolutionary algorithms have a natural tendency scale data mining needs to face! for parallel processing, hence being suitable for – Efficiently parsing the data! large-scale data mining! – Proper data structures to achieve the minimum memory • However, data mining presents a challenge that footprint! • It may sound like just a matter of programming, but it can goes beyond pure optimization, which is that make a difference! evaluation is based on data, not just on a fitness • Specially important when using specialized hardware (e.g. formula! CUDA)! – Optimized publicly available data handling libraries exist (e.g. the HDF5 library)! The Challenges of Data Mining! The Challenges of Data Mining! • Usually it is not possible to hold all the training • Preprocessing! data in memory ! – Lot of work in getting high-quality data! – Partition it and use different subsets of data at a time! – Getting the representation right! – Both require that the data miners and the end users understand • Windowing mechanisms, we will talk about them later! each other! • Efficient strategies of use of CUDA technology! – Hold different parts of the data in different machines! • Classic challenges of machine learning! • Parallel processing, we will also talk about this later! – Over-learning! • Can also data richness become a benefit not a • Our models need to have good predictive capacity! problem?! – Generating interpretable solution! • Discovering useful new knowledge inside the data ! – Data-intensive computing! 745 Large Scale Data Mining Using GBML ! • Efficiency enhancement techniques ! • Hardware acceleration techniques! What does large scale mean? Evolution as massive parallel processing • Parallelization models! The challenges of data mining • Data-intensive computing ! KALEIDOSCOPIC LARGE SCALE ! DATA MINING ! Real-world examples Wrapping up" " " " ! Prelude: Efficiency Enhancement! Efficiency Enhancement Techniques ! • Review of methods and techniques explicitly • Goal: Modify the data mining methods to improve designed for data mining purposes! their efficiency without special/parallel hardware! • Evolutionary computation efficiency enhancement • Remember: ! techniques could also be applied (and we show – An individual can be a rule, or a rule set, or a decision tree…! some examples of this too)! – Individuals parameters need to be estimated (accuracy, generality…)! • For a good tutorial on efficiency enhancement • Included in this category are: ! methods, please see GECCO 2005 Tutorial on – Windowing mechanisms! – Exploiting regularities in the data! efficiency enhancement by Kumara Sastry at ! – Fitness surrogates! – http://www.slideshare.net/kknsastry/principled-efficiency-enhancement-techniques ! – Hybrid methods! 746 Windowing Mechanisms ! Windowing Mechanisms - ILAS ! • Classic machine learning concept! • Incrementing Learning with Alternating Strata (Bacardit, 2004)! – Do we need to use all the training data all the time?! • Generation-wise windowing mechanism! – Using a subset would result in faster evaluations! • Training set is divided in non-overlapping strata! – How do we select this subset and how often is it changed?! • Each GA iteration uses a different strata, using a round-robin – How accurate the fitness estimation will be? Will it favor modularity?! policy (evaluation speedup linearly with the number of strata)! • Freitas (2002) proposed a classification of these methods in 0 Ex/n 2·Ex/n 3·Ex/n Ex three types:! Training set – Individual-wise: Changing the subset of data for each evaluated solution! – Generation-wise: Changing the subset of data at each generation of the evolutionary algorithm! Iterations – Run-wise: Selecting a single subset of data for a whole run of a GA! 0 Iter • This mechanism also introduces some extra generalization pressure, since good solutions need to survive multiple strata! ! Windowing Mechanisms - ILAS ! Exploiting Regularities! • How far can we increase the • The instances in the training set do not usually cover number of strata?! uniformly the search space! • Problem with ~260K instances and 150 strata! • Instead, there are some recurrent patterns and regularities, • Knowledge learnt on different that can be exploited for efficiency purposes! strata does not integrate