Convergence of HPC, Big Data and Machine Learning: a Science and Engineering Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Convergence of HPC, Big Data and Machine Learning: A Science and Engineering Perspective Professor Tony Hey Chief Data Scientist Rutherford Appleton Laboratory, Science and Technology Facilities Council (STFC) [email protected] The Fourth Paradigm: Data-Intensive Science Thousand years ago – Experimental Science • Description of natural phenomena Last few hundred years – Theoretical Science . 2 • Newton’s Laws, Maxwell’s Equations… 2 a 4 cG a 3 a2 Last few decades – Computational Science • Simulation of complex phenomena Today – Data-Intensive Science • Scientists overwhelmed with data sets from many different sources • Data captured by instruments • Data generated by simulations • Data generated by sensor networks }." II11.1.·,..... ,. ~ -1 - -; ; ! '" f •• ' ·- ·' - r ~ .- -~~- . - ~ .. eScience is the set of tools and technologies I ---- .. _- to support data federation and collaboration • For analysis and data mining I f'i'II- • For data visualization and exploration • For scholarly communication and dissemination With thanks to Jim Gray Particle Physics and Astronomy kaggle Search kaggle Q. Competitions Datasets Kernels Discussion learn •·· Em Higgs Boson Machine Learning Challenge Higgs I.Ii' challenge l'I!. Use the ATLAS experiment to identify the Higgs boson $13 000 1 7 85 teams 4 ,ea sago Overv ,ev Data Discussion Leaderboard Rules Overview Description Run:204153 Evaluat ion >,('(;-~\ \ ATLAS ' 'l,,.~' Event: 35369265 Prizes Ji. EXPERIMENT 2012-05-30 20:31 :28 UTC About The Sponsors Timeline - / Winners Imperial College Data Science London Institute DeepJet: Jet classification with the CMS experiment Markus Stoye Imperial College London, DSI "Big data science in astropartide physics", HAP workshop, Aachen, Germany, 20th Feb. 2018 Partic e and vertex based DNN: DeepJet --700 400 250 ➔ ➔ ➔ • • • • • • • • charg. part . • • FC global -- 700 inputs and 250.000 model parameters • Particle and ve rtex based DN N has facto r 10 less free parameters than a generic Dense DN N wou ld have • 1OOM jets used fo r training, overtraining is not an issue Impact of DNN arch'tecture :ZS- cu .c 0._ C.. ~ 10- 1 E 10-3 ............................................. -...........__................................... -....--- .....................................................__..__ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 b-jet efficiency Blue: generic DNN (650 inputs) e C QA e i uts) Red: Physics inspired DNN (650 inputs) Physics object based DN N performs best kaggle Search kaggle Q Competitions Datasets Kernels Discussion Learn •·· E9 25,000 Prize Money ~ CERN · 656 teams · 22 days ago · -- Overview Data Kernels Discussion Leaderboard Rules Overview Description To explore what our universe is made of, Evaluation scientists at CERN are colliding protons, essential ly recreating mini big bangs, and Prizes meticulously observing these collisions with intricate silicon detectors. About The Sponsors While orchestrating the collisions and Timeline observations is already a massive scientific accomplishment, analyzing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge. Machine Learning in Astronomy Machine learning and the Dark Energy Survey (DES) - Classification: galaxy type (0908.2033), star/galaxy (1306.5236), Supernovae Ia (1603.00882) - Photo-z (1507.00490) - Mass of the Local Group (1606.02694) - Search for Planet 9 in DES Slides thanks to Ofer Lahar Photometric Classification of Supernovae , Data :n , , i -,,. , "I SALT2 Parametric models Wavelets .... ~ , NB KNN SVM AN BOT Classified s e Lochner, McEwen, Peiris, Lahav, Winter 00 • ; o.o o., 0.4 () 0.4 l 0 F I po tlv r arXiv: 1603.00882 . PLAsTiCC Astronomical Classification $25,000 Prize Money Can you help make sense of t he Universe? ~ LSST Pro;ect 34 teams 3 months to go (2 months ro go until lli!!fger dead~ne) Overvi~ v D el s.c ,e f -e world's le;;d1 ng astr omers Evaluation g asp the esDe., propgties of the u i\·e.rse. Prizes nmeline we've rc5'\•ill be The Pho:o ':tric SST ;.s• ~nomic Ti ,e-Sertes Cl-ilsifc.ation C alle•ge (Pl.As iCCj asks K:;.gg !HS to helporeparetocla..:,si thedatafra th.s ew survey.Cc petitors will l:.3s,· a: onomc so rec tat va,yw1 _me into di' e<Er,t clas.ses.. scali ng fro a srra,1 trai i•g s:t ta a very arg;; tet :et a the type e SS will dr.sCDVer. Acknowledgements The a 1mal Science FoJnda on NSF) 15 ac indepenO" t ·ea:ral age.rcy cn,aied by Congress i 1g50 :a µ•omoce tt.E p•ogre:;;; cf s.ci-= ~ NSF .suDpons a.;.ic r:5earch rd __ pie to create knawl~ge tha· i!.!'.:/a·ms. the· u·e. Home Contact Us Site Map oti iVacanaes SKA Science Site SQUARE KILOMETRE ARRAY Exploring the Universe with the world's largest radio telescope Choose your local minisite , -, _ __ _ _ - ~-~- ~ ~-~~=-~~ ~~ :- '-~q ~ -~' .-- /~--~~-:~~~ -~ ----~- iii f ~ 1111 Location Design Technology Science Industry Outreach & Education News & Media Technical Publications Recruitment Contacts Home » SKA Project 11!1 Pn t IS p Latest News SKA Project 22nd Dea: er 2015 2015: a big year for ASKAP! The Square Kilometre Array (SKA) project is an international 21 December 2015 effort to build the world's largest radio telescope, with Outcomes Of The 19th SKA eventually over a square kilometre (one million square Board Meeting metres) of collecting area. The scale of the SKA represents a huge leap forward in both er I and research & development towards building and delivering a unique instrument, with the detailed design and preparation now well under way. As one of the largest scientific endeavours in 7th December 2015 history, the SKA will bring together a wealth of the world's Artist impress,on of the Square Kilometre Allay Australia Announces AUS$293.7 finest scientists, engineers and policy makers to bring the Million for the SKA project to fruition. Data Flow through the SKA ~50 PFLOPS 8.8 Tb/s ~5 Tb/s SKA1-MID 7.2 Tb/s ~2 Pb/s 100 PFLOPS SKA1-LOW 130 - 300 PB/yr Users Exploring the Universe with t h e world's largest radio telescope D Footer text Big Scientific Data from Large Experimental Facilities in the UK Rutherford Appleton Laboratory ISIS (Spallation Central Laser LHC Tier 1 computing Neutron Source) Facility JASMIN Super-Data-Cluster Diamond Light Source Cryo-SXT Data Segmentation of Cryo-Soft X-ray Tomography (Cryo-SXT) data NucleousNucleus Data ● B24: Cryo Transmission X-ray Microscopy beamline at DLS ● Data Collection: Tilt series from ±65° with 0.5° step size ● Reconstructed volumes up to 1000x1000x600 voxels ● Voxel resolution: ~40nm currently ● Total depth: up to 10μm Cytoplasm ● GOAL: Study structure and morphological changes of whole cells Neuronal-like mammalian cell line; single slice The University of Nottingham Challenges: . diamond ● Noisy data, missing wedge artifacts, missing B24 beamline Computer Vision boundaries Data Analysis Software Group Laboratory ● Tens to hundreds of organelles per dataset ● Tedious to manually annotate 3D Volume Data Segmentation ● Cell types can look different ● Few previous annotations available ● Automated techniques usually fail .,_.... ~~ [email protected] ( Tomographic Cell Analysis: Feature Extraction Workflow Features are extracted from voxels to represent their appearance: Nucleo ● Intensity-based filters (Gaussian Convolutions) us Data ● Textural filters (eigenvalues of Hessian and Structure Tensor) Preprocessing User Annotation + Machine Learning Data Representation Feature Extraction User’s Manual Segmentations Predictions Refinement User Classification Annotations Using few user annotations as an input: ● Machine learning classifier (Random Forest) trained to discriminate between Refinement Nucleus and Cytoplasm and predict the class of each SuperVoxel ● Markov Random Field then used to refine the predictions [email protected] The OCTOPUS Imaging Village • Part of UKRI STFC Central Laser STFC-funded Facility at Harwell Campus Single Particle Imaging • Cluster of microscopes and lasers >5 nm resolution and expert end-to-end multidisciplinary support • National imaging facility with peer- reviewed, funded access ➢ Look at example of Single Molecule Super-resolution Microscopy ➢ Use technique of Fluorescence Localisation Imaging with Photobleaching (FLImP) Cellular Imaging ➢ 400 nm → 10 nm > 200 nm resolution Automated acquisition with Machine Learning Co Find good FLlmP ROIS Autor Suitable fluorescent t spot properties. Determine &sep photobleachlg needs. OUtput Cell segmentation Deep learning & Acquire FUmP data ~ Locate boundaries expllclt analysis. Time series of TIRF Images. of relevant cells. Automatically determine acquisition parameters. Scan sample Deep leamlng. Brlghtfleld & TIRF. Key task - identify ROIs to acquire FLImP data • Segment cell type of interest from in NSCLC biopsy • Fibroblasts, epithelial, endothelial, red blood Single molecules in an • Segment ROI good for single molecule FLImP epithelial cell from EBUS • Well focussed, separated spots with smooth background • Deep learning (CNNs) proven in cell segmentation • We wish to apply this to our system to identify the ROIs ROI = Region Of Interest FLImP Vision for Personalized Cancer Care @ . sv samples process b10P d irnager in automate Microscope slides 1 Cell Acquire segmentation FLlmPdata Patient biopsy @ :::::=:::: :.:~....· ~:·~ ~" '-----"""" ,,,,, .. ~,;,~.:-: .. ~-~ Patient-specific Patient database Machine learning classifiers Research database predictions & decisions Feedback patient outcomes Fully automate from acquisition to output • Increase efficiency and impact