What Can SDSC Do For You? Michael L. Norman, Director Distinguished Professor of Physics
SAN DIEGO SUPERCOMPUTER CENTER Mission: Transforming Science and Society Through “Cyberinfrastructure”
“The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.” D. Atkins, NSF Office of Cyberinfrastructure 2 SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps
>300,000 times as fast as SDSC’s first supercomputer
1,000,000 times as much memory
SAN DIEGO SUPERCOMPUTER CENTER Industrial Computing Platform: Triton
• Fast • Flexible • Economical • Responsive service
• Being upgraded now with faster CPUs and GPU nodes
SAN DIEGO SUPERCOMPUTER CENTER First all 10Gig Multi-PB Storage System
SAN DIEGO SUPERCOMPUTER CENTER High Performance Cloud Storage Analogous to AWS S3
• Data preservation and sharing • Low cost • High reliability • Web-accessible
6/15/2013 SAN DIEGO SUPERCOMPUTER CENTER 7 Awesome connectivity to the outside world
10G 100G 20G XSEDE ESnet UCSD RCI 100G your 10G CENIC Link here
384 port 10G384 switch port 10G switch
commercial Internet
www.yourdatacollection.edu
SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER Over 100 in-house researchers and technical staff
Core competencies • Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • Database systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management
SAN DIEGO SUPERCOMPUTER CENTER Application Domains
• Fluid dynamics • Structural engineering • Biomolecular simulation • Computational chemistry • Seismic modeling • Coastal hydrology • Geoinformatics • Neuroinformatics • Bioinformatics/genomics • Radiology • Smart energy grids • Medicare fraud detection
SAN DIEGO SUPERCOMPUTER CENTER SDSC is at the nexus of the genomic medicine revolution
Wayne Pfeiffer
SAN DIEGO SUPERCOMPUTER CENTER bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data
MapReduce BLAST Ilkay Altintas
Assemble complex processing easily
Access transparently to diverse resources
Incorporate multiple software tools
Assure reproducibility
SAN DIEGO SUPERCOMPUTER CENTER Community development model Natasha Balac
SAN DIEGO SUPERCOMPUTER CENTER Big Data Predictive Analytics for UCSD Smart Grid
SAN DIEGO SUPERCOMPUTER CENTER Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon
SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER Center for Large Scale Data Systems Research (CLDS)
Chaitan Baru
Jim Short
SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER HPWREN: A Unique Regional Capability for Public-Private Partnerships
SAN DIEGO SUPERCOMPUTER CENTER SDSC Teaming with CALFIRE and SDG&E to Respond to and Prevent Wildfires
SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Can SDSC Do for You? • Just about anything involving high capability/capacity technical computing, data management, networking
• Our technical experts are eager to engage on R&D projects and service agreements customized to meet your needs
• There is a spectrum of ways we can interact
SAN DIEGO SUPERCOMPUTER CENTER How do you begin working with us? • You have already taken the first step by coming here today
• Join the IPP program to learn more about SDSC expertise and resources • POC Ron Hawkins ([email protected])
• Enjoy the rest of the program
SAN DIEGO SUPERCOMPUTER CENTER SDSC Data Initiatives
Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS)
SDSC, UC San Diego [email protected] 28
Outline
• SDSC and Data • Center for Large-Scale Data Systems Research (CLDS) • Graduate Student Engagement • Data Science Education and Training
SDSC IPP Research Review, June 12, 2013 29
SDSC’s Data DNA
• 25+ year history as a supercomputer center focused on data • Applied Informatics is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development • Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average success rate on highly competitive proposals • Advancing the state of the art in science, improving the science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future business apps
SDSC IPP Research Review, June 12, 2013 30
Data: A rapidly evolving set of problems
• Analytics: Real-time and historical trend analysisData velocity and volume • Integration: More, comprehensive, holistic analysisData variety • Costs ▫ Hardware, energy, software, people • Skill Sets ▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage • Competition ▫ Global talent ▫ Increasingly, local problems • Privacy
SDSC IPP Research Review, June 12, 2013 31
SDSC R&D Activities in Data
• Informatics collaborations in ▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, … • Expertise and Labs in: • Benchmarking • Machine learning
• Bioinformatics • Performance Modeling
• Computational Science • Predictive analytics
• Data warehousing • Scientific data management
• Data and info visualization • Spatial data management • Large graph and text data • Workflow systems • Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director
SDSC IPP Research Review, June 12, 2013 32
CLDS: Center for Large-Scale Data Systems Research
• Focus: Technical and technology management aspects related to big data • Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information? • Principals: Chaitan Baru, James Short
SDSC IPP Research Review, June 12, 2013 33
Big Data Benchmarking – 1 • Community activity for development of a system-level big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings • A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event stream data • Discussions with TPC and SPEC
SDSC IPP Research Review, June 12, 2013 34
Big Data Benchmarking – 2
• Workshops on Big Data Benchmarking (WBDB) 1st WBDB: May 2012, San Jose 2nd WBDB: December 2012, Pune, India 3rd WBDB: July 2013, Xi’an, China 4th WBDB: October 2012, San Jose
SDSC IPP Research Review, June 12, 2013 35
Big Data Reference Datasets • An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan, eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA) • Objective: ▫ Make reference datasets available on one or more platforms, for algorithm-level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata
SDSC IPP Research Review, June 12, 2013 36
NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs: Chaitan Baru Robert Marcus, CTO, ET-Strategies, Wo Chang, Chris Greer, NIST • Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap • First meeting: June 19th • Open to community
SDSC IPP Research Review, June 12, 2013 37
Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014 • How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information • Data Science Institute (Brocade) ▫ SDSC-level program
SDSC IPP Research Review, June 12, 2013 38
CLDS Sponsorship
• Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon) • Goals ▫ Small, focused group of core sponsors representing major industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual) • Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event)
SDSC IPP Research Review, June 12, 2013 39
Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings • WBDB ▫ Submit papers to workshops; attend workshops • Reference Datasets ▫ Participate in the Reference Datasets activity; contribute reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group • Join CLDS as a sponsor SDSC IPP Research Review, June 12, 2013 40
Data Value; How Much Information? How you can participate
• Data Value ▫ Join workshop planning and organization ▫ Contribute use cases ▫ Join CLDS as a sponsor • HMI? ▫ Contribute use cases ▫ Join CLDS as a sponsor
SDSC IPP Research Review, June 12, 2013 SDSC Data Science Institute (DSI)
• Objective: Provide training and education in data science • Audience: Industry attendees/academic researchers dealing with data • Format ▫ Coverage of end-to-end issues in data science ▫ Emphasis on hands-on learning using short course formats, e.g. 1-day, 2-day, 1-week, and up to 1-month ▫ Inclusion of modules taught by industry ▫ At-home and On-The-Road programs ▫ Possible internships associated with DSI • First offering: SDSC Summer Institute, “Discovering Big Data”, Aug 5-12, 2013
SDSC IPP Research Review, June 12, 2013 42
DSI: How you can participate
• Naming opportunity ▫ The
SDSC IPP Research Review, June 12, 2013 43
SDSC Graduate Projects Program
• SDSC Projects for CSE MS Graduate students ▫ Students work on projects with vendor hardware/software ▫ Upon successful completion, students receive internships at companies ▫ Companies have option to hire students permanently • A testbed/sandbox for big data/data science/computational science ▫ Currently have 32-node Hadoop cluster with Hortonworks HDP ▫ Plan to also install Intel Hadoop; Would like to extend to 96 nodes ▫ Discussing a project with Brocade to test performance of one of their Ethernet switches • Two students just completed their MS projects (presentations today!) ▫ Joining Google and Zynga
SDSC IPP Research Review, June 12, 2013 44
Graduate Projects Program: How you can participate
• Create a project ▫ Announce a project and offer internship after successful completion • Contribute hardware/software for projects • Contribute data with application scenarios / use cases
SDSC IPP Research Review, June 12, 2013 Bioinformatics Meets Big Data
Wayne Pfeiffer SDSC/UCSD June 12, 2013
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions for today
• What is causing the flood of data in bioinformatics? • How much data are we talking about? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of in bioinformatics? • What are their computational challenges?
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cost of DNA sequencing has dropped much faster than cost of computing in recent years, producing the flood of data
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Size matters: how much data are we talking about?
• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base) • >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base) • 300 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage • Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC has a rich set of bioinformatics software; representative codes are listed here • Pairwise sequence alignment • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA • Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT • RNA-Seq analysis • GSNAP, Tophat • De novo assembly • ABySS, Edena, SOAPdenovo, Velvet • Phylogenetic tree inference (via CIPRES gateway) • BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light • Tool kits • BEDTools, GATK, SAMtools
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Many bioinformatics projects use SDSC supercomputers
• HuTS: Human Tumor Study (STSI/SDSC) • Find mutations in tumor, and select appropriate chemotherapy • W115: Study of somatic mutations in genome of 115- year-old woman (VU Amsterdam, et al.) • Find somatic mutations in white blood cells • MRSA (STSI/UCSD) • Characterize genomes of MRSA found in local hospitals • Larry Smarr’s microbiome (UCSD) • Analyze Larry Smarr’s microbiome • Various phylogenetics studies (via CIPRES gateway) • Calculate phylogenetic trees
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for read mapping & variant calling
DNA reads in FASTQ format Goal: identify simple variants, e.g., • single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs) Read mapping, i.e., Reference CACCGGCGCAGTCATTCTCAT pairwise alignment: genome in AAT
BFAST, BWA, … FASTA format ||||||||||| |||||||||||| CACCGGCGCAGACATTCTCAT AAT • short insertions & deletions (indels) Alignment info Variant calling: CACCGGCGCAGTCATTCTCATAAT in BAM format GATK, … |||||||||| ||||||||||| CACCGGCGCA ATTCTCATAAT
Variants: SNPs, indels, others
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy
BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The CIPRES gateway lets biologists run phylogenetics codes at SDSC via a browser interface; http://www.phylo.org/index.php/portal
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational challenges abound in bioinformatics
• Large amounts of data, which can grow substantially during analysis • Complex workflows, often with different computational requirements along the way • Parallelism that varies between steps in the workflow pipeline • Large shared memory needed for some analyses
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Distributed Workflow-Driven Analysis of Biological Big Data
Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
So, what is a scientific workflow?
Scientific workflows emerged as an answer to the need to combine multiple Cyberinfrastructure components in automated process networks.
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO The Big Picture is Supporting the Scientist
From “Napkin Drawings” to Executable Workflows Circonspect
Combine Results PHACCS Fasta File
Average Genome Size
Conceptual SWF
Executable SWF
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Scientific Workflow Automation Technologies @ SDSC
• Housed in San Diego Supercomputer Center at UCSD since 2004
• Mission: Support CI projects, scientists and engineers for computational practices involving process management
• Research and development focus • Scientific workflow management • Data and process provenance • Distributed execution using scientific workflows • Engineering and streaming workflows for environmental observatories • Fault tolerance in scientific workflows • Sensor network management and monitoring • Role of scientific workflows in eScience infrastructures • Understanding collaborative work in workflow-driven eScience
• Scientific collaborations • Bioinformatics, Environmental Observatories, Oceanography, Computational Chemistry, Fusion, Geoinformatics, … SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
Workflows are Used as Toolboxes in Biological Sciences
– Assemble complex processing easily
– Access transparently to diverse resources
– Incorporate multiple software tools Acquisition Generation – Assure reproducibility Data – Community development model
Workflows foster Data Analysis collaborations! • Flexibility and synergy • Optimization of resources Publication • Increasing reuse Archival • Standards compliance Data Need SANexpertise DIEGO SUPERCOMPUTER to identify CENTER which tool to use when and how! Require computation models to scheduleUNIVERSITY and optimize OF CALIFORNIA, execution! SAN DIEGO Kepler is a Scientific Workflow System
www.kepler-project.org
• An open collaboration • Builds upon the open-source … initiated August 2003 Ptolemy II framework • Kepler 2.4 released 04/2013
Ptolemy II: A laboratory for investigating design
KEPLER: A problem-solving environment for Scientific Workflow
KEPLER = “Ptolemy II + X” for Scientific Workflows SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO CAMERA Example:
Using Scientific Workflows and Related Provenance for Collaborative Metagenomics Research
Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) http://camera.calit2.net
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO CAMERA is a Collaborative Environment
Data Cart Multiple Available User Workspace Mixed collections of Single workspace with CAMERA Data (e.g. projects, samples) access to all data and Data Discovery results (private and GIS and Advanced query shared) options
Group Workspace Share specified User Data Analysis Workspace data with collaborators Workflow based analysis SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Workflows are a Central Part of CAMERA
More than 1500 workflow submissions monthly!
Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database
SAN DIEGO SUPERCOMPUTER CENTER All can be reached through the CAMERA portal at: UNIVERSITY OF CALIFORNIA,http://portal.camera.calit2.net SAN DIEGO Pushing the boundaries of existing infrastructure and workflow system capabilities! • Increase reuse • Increase programmability by end users • Increase resource utilization • Make analysis a part of the end-to-end scientific model from data generation to publication
Add to these large amounts of next generation sequencing data!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO bioKepler is a coordinated ecosystem of biological and technological packages! www.kepler-project.org www.bioKepler.org
Cyberinfrastructure platforms Kepler and Provenance Framework bioKepler BIO COMPUTE BioLinux Galaxy Hadoop Stratosphere
+
• Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data
SAN DIEGO SUPERCOMPUTER CENTER
Improvement on usability and programmabilityUNIVERSITY OF CALIFORNIA, by SAN endDIEGO users! What can we do for you?
Access to technology and biology packages: • Bio-Linux • Galaxy via Amazon MapReduce BLAST • Amazon Cloud • Hadoop • Stratosphere Partners • Individual www.bioKepler.org researchers and labs • Research projects, • Training e.g., CAMERA • Academic institutions • Workflow-driven data- and compute- • Private labs, e.g., intensive process JCVI • Consulting • Designing, scaling and tracking JCVI selection criteria • Programmability pipelines and workflows • Modularity • Development Services • Customizability • Production workflows for you • Scalability SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Thanks! & Questions…
Ilkay Altintas [email protected]
How to download Kepler? https://kepler-project.org/users/downloads Please start with the short Getting Started Guide: https://kepler-project.org/users/documentation
SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Building a Semantic Information Infrastructure
Research Review: SDSC Industrial Partnership Program
Amarnath Gupta
Advanced Query Processing Lab San Diego Supercomputer Center, University of California at San Diego
SAN DIEGO SUPERCOMPUTER CENTER Lots of Data, Little Glue
A Science Enterprise An Industrial Enterprise • Lab resources • Customer data • Experiment design • Product specifications • Reagent catalogs • Product sales data • Experimental data • Production process data • Chains of Derived Data • Customer call records • Analysis Results (incl. • Internal memos outputs of software tools) • Legal documents • External Information • Emails • Publications and • Social Intranet Presentations • … • … Wide variation in data types, models, volume, systems, usage, updates, … What binds them together? SAN DIEGO SUPERCOMPUTER CENTER Some Consequences of Not Having a Glue
• An Orphan Disease Researcher • Spends five months to determine that her disease of interest relates to the mouse version of a gene studied in breast cancer • A Health Institution • Takes over a year to integrate clinical trial management data, drug characteristics, and patient reports to get a 360 deg. view of drug effects • A Reagent Vendor • Spends significant time and money to get to determine the effectiveness of a product-line • A Legal Department • Takes much longer to discover relevant documents and data for a complex, multi-party litigation • A Utilities Company • Cannot easily perform Synchrophasor analytics for grid behavior understanding because SCADA, EMS, PMU data cannot be integrated
SAN DIEGO SUPERCOMPUTER CENTER Integration through Semantics
We have been building NIF, a Semantic Information System for Neuroscience What NIF is designed to • In NIF answer • Data can be relational, XML, RDF, OWL, wiki content, manuscripts, publications, blogs, multimedia, annotations, … • Any new domain goes through Some unanticipated questions semantic processing . Are NIH studies gender-biased? • Every piece of data gets some Which? semantic markup . Is the CRE mouse line being used • Data are integrated and searched by anyone? using ontological indices • Any search or discovery process . Which diseases have data but is interpreted through an ontology have been underfunded? • Any workflow that performs a . Is the method section in this paper mining-style computation utilizes under-specified? semantic properties SAN DIEGO SUPERCOMPUTER CENTER Technology Research
• Semantic Prospecting for any industry and application domain • Semi-automatic Construction of Semantic Models for any problem domain • Developing general-purpose, yet domain-specific Semantic Search Engines for heterogeneous data • Complex Information Discovery using semantic graph analysis and mining-style computation • Multi-domain Information Integration using semantic information bridges
SAN DIEGO SUPERCOMPUTER CENTER Why SDSC? • Semantic processing can be complex and costly • Domain model construction (incl. active learning) • Semantic indexing and correlation indexing • Graph query processing • Semantic search algorithms • Landscape and discovery analytics • The SDSC infrastructure handles complexity at scale • Configurable compute nodes • Large-Memory Systems • SSD drives Let SDSC help your infrastructure, research and service needs
SAN DIEGO SUPERCOMPUTER CENTER Biomedical data integration system and web search engine
Julia Ponomarenko, PhD Michael Baitaluk, PhD
San Diego Supercomputer Center
SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data
SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data
Databases Web pages (2,000+)
How can a researcher embrace such amount of data in their entirety?
SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data
Databases Web pages (2,000+)
Data Integration Resources
SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data
Databases Web pages (2,000+)
Data Integration Resources This leaves a researcher to work with partial, incomplete, incomprehensive data sets!
SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications (0.1 PB) Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data
Databases Web pages (2,000+)
Biological Ontologies The Semantic Web technologies
Data Warehouse
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages
For each ontological term A Automatically extract and page X, calculate the data and map them relevance score of X to A into the internal database schema SAN DIEGO SUPERCOMPUTER CENTER User Community
Web-portal & API Java-application integromeDB.org BiologicalNetworks.org
IntegromeDB
Public Data on the Web User’s Private Data
SAN DIEGO SUPERCOMPUTER CENTER integromedb.org
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER Integromedb.org visit statistics
SAN DIEGO SUPERCOMPUTER CENTER User Community
Web-portal & API Java-application integromeDB.org BiologicalNetworks.org
IntegromeDB
Public Data on the Web User’s Private Data
SAN DIEGO SUPERCOMPUTER CENTER Your User Community
Web-portal & APIs Applications
Your Database
Public Data on the Web User’s Private Data
SAN DIEGO SUPERCOMPUTER CENTER Your User Community
Your Google-like “Search-in-the-Box” Appliance
SAN DIEGO SUPERCOMPUTER CENTER PACE Predictive Analytics And Data Mining Research and Applications Natasha Balac, Ph.D. Director, PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO PACE: Closing the gap between Government, Industry and Academia
PACE is a non-profit, public Inform, Educate and educational organization Train
Bridge the Develop To promote, educate and innovate Industry and Standards o Academia and in the area of Predictive Analytics Gap Methodology oTo leverage predictive analytics to Predictive improve the education and well Analysis
Provide Center of High being of the global population and Predictive Excellence Performance Analytics Scalable economy Services Data Mining oTo develop and promote a new, multi-level curriculum to broaden Data Mining Foster Repository of Research Very Large and participation in the field of Data Sets Collaboration predictive analytics
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Foster Research and Collaboration
• Fraud Detection
Inform, • Modeling user behaviors Educate and Train • Smart Grid Analytics Bridge the Develop Industry and Standards and Academia Methodology Gap • Solar powered system modeling Predictive Analysis Center of Provide High • Excellence Microgrid anomaly Predictive Performance Analytics Scalable Services Data Mining detection • Distributed Energy Data Mining Repository of Foster Very Large Research and Generation Collaboration Data Sets • Manufacturing • Sport Analytics • Genomics
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO UCSD Smart Grid
• UCSD Smart Grid sensor network data set • 45MW peak micro grid; daily population of over 54,000 people • Self-generate 92% of its own annual electricity load • Smart Grid data – over 100,000 measurements/sec • Sensor and environmental/weather data • Large amount of multivariate and heterogeneous data streaming from complex sensor networks • Predictive Analytics throughout the Microgrid
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO 4 V’s of Big Data
IBM, 2012
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO What to do with big data? ERIC SALL(IBM)’s list • Big Data Exploration • To get an overall understanding of what is there • 360 degree view of the customer • Combine both internally available and external information to gain a deeper understanding of the customer • Monitoring Cyber-security and fraud in real time • Operational Analysis • Leveraging machine generated data to improve business effectiveness • Data Warehouse Augmentation • Enhancing warehouse solution with new information models and architecture
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Big Data – Big Training
• “Data Scientist” • The “Hot new gig in town” • O’Reilly report • Data Scientist: The Sexiest Job of the 21st Century • Harvard Business Review, October 2012 • The future belongs to the companies and people that turn data into products • Article in Fortune • “The unemployment rate in the U.S. continues to be abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Science Job Growth
By 2018 shortage of 140-190,000 predictive analysts and 1.5M managers / analysts in the US
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO PACE Education • Data mining Boot Camps • Boot Camp 1 • September 12-13, 2013 • Boot Camp 2 • October 17- 18, 2013 • On-site Personalized Boot Camp • 10-15; 20-30 • Tech Talks – every 3rd Wednesday • Workshops, Webinars • Interesting Reads, “Tool-off” • “Bring your own data”
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Predictive Analytics Consulting
Inform, Educate and Train
Bridge the Develop • Full-service consulting and Industry and Standards and Academia Gap Methodology development services
Predictive • Targeted projects with Analysis Center of Provide Excellence High industry and agency partners Predictive Performance Analytics Scalable Services Data Mining • Applied and applications oriented research Data Mining Foster Repository of Very Large Research and Collaboration • Technical expertise and Data Sets industry experience
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Mike Gualtieri's blog at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions?
• http://pace.sdsc.edu/
• For further information, contact Natasha Balac [email protected]
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance, Modeling, and Characterization (PMaC) Lab
Laura Carrington, Ph.D PMaC Lab Director
University of California, San Diego San Diego Supercomputer Center
PMaC Performance Modeling and Characterization The PMaC Lab
Mission Statement: Research the complex interactions between HPC systems and applications to predict and understand the factors that affect performance and power on current and projected HPC platforms.
• Develop tools and techniques to deconstruct HPC systems and HPC applications to provide detailed characterizations of their power and performance.
PMaC Performance Modeling and Characterization The PMaC Lab Utilize the characterizations to construct power and performance models that guide: – Improvement of application performance 2 Gordon Bell Finalists, DoD HPCMP Applications, NSF BlueWaters, etc. – System procurement, acceptance, and installation DoD HPCMP procurement team, DoE upgrade of ORNL Jaguar, installation of NAVO PWR6 – Accelerator assessment for a workload Performance assessment and prediction for GPUs & FPGAs – Hardware customization/ Hardware-Software co-design Performance and power analysis for Exascale – Improvement of energy efficiency and resiliency Green Queue Project, DoE SUPER Institute, DoE BSM, etc.
PMaC Performance Modeling and Characterization Automated tools & techniques to characterize HPC systems & applications
HPC System HPC Application Characterize the computational Characterize the computational (& communication) patterns (& communication) behavior of affect the overall power draw application
Loop #1
Loop #2
Loop #3
Func. Foo
Design software- and hardware-aware energy and performance optimization techniques
PMaC 113 Performance Modeling and Characterization Hardware Customization
• 10x10 project Analysis of 37 workloads – PI: A. Chien @ U. of Chicago Space # Configurations in # Unique Avg. Energy – Heterogeneous processor Name search space Configurations selected Savings (%) Full 2652 33 68.8 architecture 2N 469 26 66.0 Restricted 224 23 65.5 Cluster 10 10 63.7 • Reconfigurable memory hierarchy ~70% energy savings – L1, L2, and L3 • Selection of energy-optimal configuration – Simulation – Reuse distance
PMaC 114 Performance Modeling and Characterization PMaC’s Green Queue Framework (optimizing for performance & power) Goal: Use machine and application characterization to make application-aware energy optimizations during execution 1024 cores Gordon (4.8%) (21%) (5.3%) (6.5%) (Energy Savings %) (19%)
(32%) (6.5%) Power (kW)
Time (seconds)
PMaC 115 Performance Modeling and Characterization Questions
PMaC 116 Performance Modeling and Characterization Benchmarking and Tuning Big Data Software
David Nadeau, Ph.D.
SAN DIEGO SUPERCOMPUTER CENTER What to tune? • Application benchmarking/profiling finds what to tune for one system
• But... we want answers for many systems for now and (hopefully) years to come
• Requires understanding fundamental trends • Processor speeds, core counts, memory bandwidth, memory latency, network bandwidth, etc.
SAN DIEGO SUPERCOMPUTER CENTER A few trends
Clock speeds almost flat Cycles/math op almost flat
Raw math ability/core is not improving.
SAN DIEGO SUPERCOMPUTER CENTER A few trends
SPEC float/CPU way up Core count/CPU way up
Float math/core (thread) up 8 to 20%/year Why is this trend upwards if math speed is flat?
SAN DIEGO SUPERCOMPUTER CENTER A few trends
DDR bandwidth Memory bandwidth/core up 15%/year up 0 to 12%/year Improvement is primarily from better memory bandwidth, not math ability.
SAN DIEGO SUPERCOMPUTER CENTER What this means • SPEC, Dhrystones, etc. are now dominated by memory performance not math. • 1 Multiply = 1 cycle • 1 Memory access = 300 cycles • Not likely to change soon
• Application performance is dominated by memory access costs.
• So tune access patterns or data order.
SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume Task: Store 3D volume in memory. Array of arrays of arrays, one big array, etc.
Worst (blue): array of array of arrays. Best (black): one big array with simple 3D indexing. Fewer memory references is much faster.
SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume sweep Task: Sweep plane thru volume. 6 axis directions: +X, -X, +Y, -Y, +Z, -Z
4 sweep directions are slow. 2 are 10x to 30x faster.
Sweep in natural data order is much faster.
SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume bricking Task: Sweep in any direction. Brick it: Cube of cubes
Unbricked 2x2x2 4x4x4 8x8x8 16x16x16 Makes all sweep directions have similar performance.
Bricked data order is more cache friendly.
SAN DIEGO SUPERCOMPUTER CENTER Example: Desktop compression Task: Real-time compress & send desktop. Many codecs.
Clever codecs slower, despite smaller result data.
Codecs with fewer memory references are faster.
SAN DIEGO SUPERCOMPUTER CENTER Example: Parallel compositing Task: Composite N images on cluster of N nodes. Many algorithms.
Many small messages Few big messages Same amount of data
Better net use is faster.
SAN DIEGO SUPERCOMPUTER CENTER And so on…
SAN DIEGO SUPERCOMPUTER CENTER Gordon: A First-of-its Kind Data-intensive Supercomputer
SDSC Research Review June 5, 2013
Shawn Strande Gordon Project Manager
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Gordon is a highly flexible system for exploring a wide range of data intensive technologies and applications High performance flash technology Massively large memory environments High performance parallel file system
Scientific databases
On-demand Hadoop and data intensive environments Complex application architectures
New algorithms and optimizations Gordo High speed InfiniBand interconnect
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Gordon is a data movement machine
Sandy Bridge Compute Nodes Flash based I/O Nodes (64) (1,024) • 300 TB Intel eMLC flash • 64 TB memory • 35M IOPS • 341 Tflop/s
Large Memory Nodes • vSMP Foundation 5.0 • 2 TB of cache-coherent Dual-rail, 3D Torus memory per node Interconnect • 7GB/s “Data Oasis” Lustre PFS 100 GB/sec, 4 PB
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SSD latencies are 2 orders of magnitude lower than HDD’s (that’s a big deal for some data intensive applications)
Typical hard drive Solid State Disk ~ 10 ms (.010s) ~ 100 µs (.0001s) IOPS = 200 IOPS = 35,000/3000 (R/W)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Protein Data Bank (flash-based I/O node)
The RCSB Protein Data Bank (PDB) is the leading primary database that provides access to the experimentally determined structures of proteins, nucleic acids and complex assemblies. In order to allow users to quickly identify more distant 3D relationships, the PDB provides a pre-calculated set of all possible pairwise 3D protein structure alignments.
Although the pairwise structure comparisons are computationally intensive, the bottleneck is the centralized server that is responsible for assigning work, collecting results and updating the MySQL database.
Using a dedicated Gordon I/O node and the associated 16 compute nodes, work could be accomplished 4-6x faster than using the OSG
Configuration Time for 15M alignments speedup Reference (OSG) 24 hours 1 Lyndonville 6.3 hours 3.8 Taylorsville 4.1 hours 5.8
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO OpenTopography Facility (flash-based I/O node) The NSF funded OpenTopography Facility provides online access to Earth science-oriented high- resolution LIDAR topography data along with online processing tools and derivative products. Point cloud data are processed to produce digital elevation models (DEMs) - 3D representations of the landscape.
Local binning algorithm utilizes the elevation information from only the points inside of a circular search area with user specified radius. An out-of-core (memory) version of the local binning algorithm exploits secondary storage for saving intermediate results when the size of a grid exceeds that of memory.
Using a dedicated Gordon I/O node with the fast SSD drives High-resolution bare earth DEM of San reduces run times of massive Andreas fault south of San Francisco, Illustration of local binning geometry. concurrent out-of-core processing generated using OpenTopography Dots are LIDAR shots ‘+’ indicate jobs by a factor of 20x LIDAR processing tools locations of DEM nodes at which Source: C. Crosby, UNAVCO elevation is estimated based
Dataset and processing configuration # concurrent jobs OT Servers Gordon ION Speed-up Lake Tahoe 208 Million LIDAR returns 1 3297 sec 1102 sec 3x 0.2-m grid res and 0.2 m rad. 4 29607 sec 1449 sec 20x
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IntegromeDB (flash-based I/O node)
The IntegromeDB is a large-scale data integration system and biomedical search engine. IntegromeDB collects and organizes heterogeneous data from over a thousand databases covered by the Nucleic Acid and millions of public biomedical, biochemical, drug and disease-related resources
IntegromeDB is a distributed system stored in a PostgreSQL database containing over 5,000 tables, 500 billion rows and 50TB of data. New content is acquired using a modified version of the SmartCrawler web crawler and pages are indexed using Apache Lucene.
Project was awarded two Gordon I/O nodes, the accompanying compute nodes and 50 TB of space on Data Oasis. The compute nodes are used primarily for post-processing of raw data. Using the I/O nodes dramatically increased the speed of read/write file operations (10x) and I/O database operations (50x).
Source: Michael Baitaluk (UCSD) Used by permission 2013
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Structural response of bone to stress (vSMP)
The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.
• 5 million quadratic, 8 noded elements • Model created with custom Matlab application that converts 253 micro CT images into voxel- based finite element models
Source: Matthew Goff, Chris Hernandez (Cornell University) Used by permission. 2012
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Managing HPC Systems at the National and Campus Levels
Rick Wagner, Ph.D. Candidate HPC Systems Manager, SDSC
SAN DIEGO SUPERCOMPUTER CENTER TSCC Systems
Gordon
Trestles
Data Oasis Dev SAN DIEGO SUPERCOMPUTER CENTER Challenges Enabling unique & differentiating features of each system… Gordon Technology
Trestles Policy
TSCC Business Model
Data Oasis Agnosticism Dev Isolation …without maintaining N different systems.
SAN DIEGO SUPERCOMPUTER CENTER Solutions
Part I: • Common systems management – Rocks • Shared staff responsibility across systems
Part II: • Build on (cower behind) core SDSC services
SAN DIEGO SUPERCOMPUTER CENTER User Services
SAN DIEGO SUPERCOMPUTER CENTER Operations
SAN DIEGO SUPERCOMPUTER CENTER Storage
SAN DIEGO SUPERCOMPUTER CENTER VM Hosting
SAN DIEGO SUPERCOMPUTER CENTER Security
SAN DIEGO SUPERCOMPUTER CENTER Networking
SAN DIEGO SUPERCOMPUTER CENTER SDSC Sandbox
With support from:
SAN DIEGO SUPERCOMPUTER CENTER SDSC’s Myriad Areas of Expertise
Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Visualization Services Group
Lead – Amit Chourasia • Develops new ways to represent data visually • Research collaborations in many science and engineering disciplines • Visualization support and consulting • Provides visualization education and training • Website: http://www.sdsc.edu/us/visservices/
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Ross Walker ([email protected]) Andreas Goetz ([email protected])
Technical Expertise • GPU Computing • CUDA Teaching Center • Parallel computing • Workstation and Cluster Design for Biomolecular Simulations and Computational Drug Discovery • Cloud computing
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Ross Walker ([email protected]) Andreas Goetz ([email protected])
Scientific Expertise • Molecular Dynamics • Quantum Chemistry • Force Field Development, Automatic Parameter Fitting • Drug Discovery • Biomolecular Simulations
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC Spatial Information Systems Lab Contact: Ilya Zaslavsky ([email protected]) • Services-based spatial information Hydrologic Information System (largest in the world) integration infrastructure • Advanced online and high performance GIS and Geospatial Databases • Information Interoperability in the geosciences • Long-term spatial data preservation • Information models and data standards (adopted by federal government and internationally) • Innovative user interfaces for connecting people, projects, resources… CZO • Large distributed data systems and catalogs (for scientific field observations from NSF EarthCube hydrology, critical zone and others) Brain data integration Mexico Health Atlas
SANEcosystem DIEGO SUPERCOMPUTER CENTER Services Katrina Dashboard portal UNIVERSITY OF CALIFORNIA, SAN DIEGO High Performance Wireless Research and Education http://hpwren.ucsd.edu/ Network
An extension of Area Situational Awareness for Public Safety Network (ASAPnet) Existing ~60 HPWREN/ASAPnet fire agency sites in June 2013 (from Google Earth KML object)
Project partners include: • the County of San Diego • the California Department of Forestry and Fire Protection (CAL FIRE) • the United States Forest Service (USFS) • San Diego Gas and Electric (SDG&E) • San Diego State University (SDSU)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Mathematical anthropology
The identification of cohesive subgroups in large networks is of key importance to problems spanning the social and biological sciences. A k-cohesive subgroup has the property that it is resistant to disruption by disconnection by removal of at least k of its nodes. This has been shown to be equivalent to a set of vertices where all members are joined by k independent vertex-independent paths (Menger’s theorem).
Doug White (UCI) and his collaborators are using software developed using R and the igraph package to study social networks.
The software was parallelized using the R multicore package and ported to Gordon’s vSMP nodes by SDSC computational scientist Robert Sinkovits. Analyses for large problems (2400 node Watts-Strogatz model) are achieving estimated speedups of 243x on 256 compute cores. Work is underway to identify cohesive subgroups in large co-authorship networks
James Moody, Douglas R. White. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Impact of high-frequency trading
To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion
Run times for LOB construction of heavily traded NASDAQ securities (June 4, 2010)
Symbol wall time (s) wall time (s) speedup orig. code opt. code SWN 8400 128 66x AMZN 55200 437 126x AAPL 129914 1145 113x
Optimizations by SDSC computational scientists Robert Sinkovits and DongJu Choi to the original thread-parallel code resulted in greater than 100x speedups. It is now possible to analyze entire day of NASDAQ activity in a few hours using 16 Gordon nodes. With new capabilities, beginning to consider analysis of options data with 100x greater memory requirements.
Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC’s Education, Outreach and Training (EOT) Programs
Diane Baxter, Ph.D., Ange Mason, Jeff Sale San Diego Supercomputer Center University of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC EOT Program Challenges • Prepare Teachers to teach their students the skills and knowledge for a future in which . . . • Technology Power • Computational skills Success • Give Students Access to the computational tools, knowledge and thinking skills to seek their dreams and create their future • Train researchers at all levels, to use HPC and Data-Intensive Computing tools to accelerate discovery in science, engineering, technology, mathematics, and other data-related fields
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Thanks! & Questions…
Ilkay Altintas [email protected]
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO Industrial Engagement at SDSC
• Industrial Partners Program (IPP) • “Gateway” program • Annual membership • Large company, small company, individual categories • CLDS • Focus on Big Data • PACE • Focus on Predictive Analytics • Research Contracts • Specific project defined • Service Agreements • For use of SDSC resources/services
SAN DIEGO SUPERCOMPUTER CENTER
THANK YOU!
SAN DIEGO SUPERCOMPUTER CENTER