<<

What Can SDSC Do For You? Michael L. Norman, Director Distinguished Professor of Physics

SAN DIEGO SUPERCOMPUTER CENTER Mission: Transforming Science and Society Through “Cyberinfrastructure”

“The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.” D. Atkins, NSF Office of Cyberinfrastructure 2 SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?

SAN DIEGO SUPERCOMPUTER CENTER Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps

>300,000 times as fast as SDSC’s first supercomputer

1,000,000 times as much memory

SAN DIEGO SUPERCOMPUTER CENTER Industrial Computing Platform: Triton

• Fast • Flexible • Economical • Responsive service

• Being upgraded now with faster CPUs and GPU nodes

SAN DIEGO SUPERCOMPUTER CENTER First all 10Gig Multi-PB Storage System

SAN DIEGO SUPERCOMPUTER CENTER High Performance Cloud Storage Analogous to AWS S3

• Data preservation and sharing • Low cost • High reliability • Web-accessible

6/15/2013 SAN DIEGO SUPERCOMPUTER CENTER 7 Awesome connectivity to the outside world

10G 100G 20G XSEDE ESnet UCSD RCI 100G your 10G CENIC Link here

384 port 10G384 switch port 10G switch

commercial Internet

www.yourdatacollection.edu

SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?

SAN DIEGO SUPERCOMPUTER CENTER Over 100 in-house researchers and technical staff

Core competencies • Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management

SAN DIEGO SUPERCOMPUTER CENTER Application Domains

• Fluid dynamics • Structural engineering • Biomolecular simulation • Computational chemistry • Seismic modeling • Coastal hydrology • Geoinformatics • • Bioinformatics/genomics • Radiology • Smart energy grids • Medicare fraud detection

SAN DIEGO SUPERCOMPUTER CENTER SDSC is at the nexus of the genomic medicine revolution

Wayne Pfeiffer

SAN DIEGO SUPERCOMPUTER CENTER bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data

MapReduce BLAST Ilkay Altintas

Assemble complex processing easily

Access transparently to diverse resources

Incorporate multiple software tools

Assure reproducibility

SAN DIEGO SUPERCOMPUTER CENTER Community development model Natasha Balac

SAN DIEGO SUPERCOMPUTER CENTER Big Data Predictive Analytics for UCSD Smart Grid

SAN DIEGO SUPERCOMPUTER CENTER Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon

SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?

SAN DIEGO SUPERCOMPUTER CENTER Center for Large Scale Data Systems Research (CLDS)

Chaitan Baru

Jim Short

SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?

SAN DIEGO SUPERCOMPUTER CENTER HPWREN: A Unique Regional Capability for Public-Private Partnerships

SAN DIEGO SUPERCOMPUTER CENTER SDSC Teaming with CALFIRE and SDG&E to Respond to and Prevent Wildfires

SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do?

SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Can SDSC Do for You? • Just about anything involving high capability/capacity technical computing, data management, networking

• Our technical experts are eager to engage on R&D projects and service agreements customized to meet your needs

• There is a spectrum of ways we can interact

SAN DIEGO SUPERCOMPUTER CENTER How do you begin working with us? • You have already taken the first step by coming here today

• Join the IPP program to learn more about SDSC expertise and resources • POC Ron Hawkins ([email protected])

• Enjoy the rest of the program

SAN DIEGO SUPERCOMPUTER CENTER SDSC Data Initiatives

Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS)

SDSC, UC San Diego [email protected] 28

Outline

• SDSC and Data • Center for Large-Scale Data Systems Research (CLDS) • Graduate Student Engagement • Data Science Education and Training

SDSC IPP Research Review, June 12, 2013 29

SDSC’s Data DNA

• 25+ year history as a supercomputer center focused on data • Applied is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development • Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average success rate on highly competitive proposals • Advancing the state of the art in science, improving the science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future business apps

SDSC IPP Research Review, June 12, 2013 30

Data: A rapidly evolving set of problems

• Analytics: Real-time and historical trend analysisData velocity and volume • Integration: More, comprehensive, holistic analysisData variety • Costs ▫ Hardware, energy, software, people • Skill Sets ▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage • Competition ▫ Global talent ▫ Increasingly, local problems • Privacy

SDSC IPP Research Review, June 12, 2013 31

SDSC R&D Activities in Data

• Informatics collaborations in ▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, … • Expertise and Labs in: • Benchmarking • Machine

• Bioinformatics • Performance Modeling

• Computational Science • Predictive analytics

• Data warehousing • Scientific data management

• Data and info visualization • Spatial data management • Large graph and text data • Workflow systems • Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director

SDSC IPP Research Review, June 12, 2013 32

CLDS: Center for Large-Scale Data Systems Research

• Focus: Technical and technology management aspects related to big data • Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information? • Principals: Chaitan Baru, James Short

SDSC IPP Research Review, June 12, 2013 33

Big Data Benchmarking – 1 • Community activity for development of a system-level big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings • A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event stream data • Discussions with TPC and SPEC

SDSC IPP Research Review, June 12, 2013 34

Big Data Benchmarking – 2

• Workshops on Big Data Benchmarking (WBDB)  1st WBDB: May 2012, San Jose  2nd WBDB: December 2012, Pune, India  3rd WBDB: July 2013, Xi’an, China  4th WBDB: October 2012, San Jose

SDSC IPP Research Review, June 12, 2013 35

Big Data Reference Datasets • An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan, eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA) • Objective: ▫ Make reference datasets available on one or more platforms, for -level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata

SDSC IPP Research Review, June 12, 2013 36

NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs:  Chaitan Baru  Robert Marcus, CTO, ET-Strategies,  Wo Chang, Chris Greer, NIST • Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap • First meeting: June 19th • Open to community

SDSC IPP Research Review, June 12, 2013 37

Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014 • How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information • Data Science Institute (Brocade) ▫ SDSC-level program

SDSC IPP Research Review, June 12, 2013 38

CLDS Sponsorship

• Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon) • Goals ▫ Small, focused group of core sponsors representing major industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual) • Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event)

SDSC IPP Research Review, June 12, 2013 39

Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings • WBDB ▫ Submit papers to workshops; attend workshops • Reference Datasets ▫ Participate in the Reference Datasets activity; contribute reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group • Join CLDS as a sponsor SDSC IPP Research Review, June 12, 2013 40

Data Value; How Much Information? How you can participate

• Data Value ▫ Join workshop planning and organization ▫ Contribute use cases ▫ Join CLDS as a sponsor • HMI? ▫ Contribute use cases ▫ Join CLDS as a sponsor

SDSC IPP Research Review, June 12, 2013 SDSC Data Science Institute (DSI)

• Objective: Provide training and education in data science • Audience: Industry attendees/academic researchers dealing with data • Format ▫ Coverage of end-to-end issues in data science ▫ Emphasis on hands-on learning using short course formats, e.g. 1-day, 2-day, 1-week, and up to 1-month ▫ Inclusion of modules taught by industry ▫ At-home and On-The-Road programs ▫ Possible internships associated with DSI • First offering: SDSC Summer Institute, “Discovering Big Data”, Aug 5-12, 2013

SDSC IPP Research Review, June 12, 2013 42

DSI: How you can participate

• Naming opportunity ▫ The Data Science Institute • Sign-up for SDSC Summer Institute ▫ $2K for 3 days or 5 days • Sign-up for future DSI offerings • On-the-road program ▫ Work with us to create an on-the-road program for your company, or your customers • Contribute training modules ▫ Contribute modules based on your technology • Provide your case studies for use in DSI

SDSC IPP Research Review, June 12, 2013 43

SDSC Graduate Projects Program

• SDSC Projects for CSE MS Graduate students ▫ Students work on projects with vendor hardware/software ▫ Upon successful completion, students receive internships at companies ▫ Companies have option to hire students permanently • A testbed/sandbox for big data/data science/computational science ▫ Currently have 32-node Hadoop cluster with Hortonworks HDP ▫ Plan to also install Intel Hadoop; Would like to extend to 96 nodes ▫ Discussing a project with Brocade to test performance of one of their Ethernet switches • Two students just completed their MS projects (presentations today!) ▫ Joining Google and Zynga

SDSC IPP Research Review, June 12, 2013 44

Graduate Projects Program: How you can participate

• Create a project ▫ Announce a project and offer internship after successful completion • Contribute hardware/software for projects • Contribute data with application scenarios / use cases

SDSC IPP Research Review, June 12, 2013 Bioinformatics Meets Big Data

Wayne Pfeiffer SDSC/UCSD June 12, 2013

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions for today

• What is causing the flood of data in bioinformatics? • How much data are we talking about? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of in bioinformatics? • What are their computational challenges?

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cost of DNA sequencing has dropped much faster than cost of computing in recent years, producing the flood of data

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Size matters: how much data are we talking about?

• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base) • >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base) • 300 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage • Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC has a rich set of bioinformatics software; representative codes are listed here • Pairwise sequence alignment • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA • Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT • RNA-Seq analysis • GSNAP, Tophat • De novo assembly • ABySS, Edena, SOAPdenovo, Velvet • Phylogenetic tree inference (via CIPRES gateway) • BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light • Tool kits • BEDTools, GATK, SAMtools

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Many bioinformatics projects use SDSC supercomputers

• HuTS: Human Tumor Study (STSI/SDSC) • Find mutations in tumor, and select appropriate chemotherapy • W115: Study of somatic mutations in genome of 115- year-old woman (VU Amsterdam, et al.) • Find somatic mutations in white blood cells • MRSA (STSI/UCSD) • Characterize genomes of MRSA found in local hospitals • Larry Smarr’s microbiome (UCSD) • Analyze Larry Smarr’s microbiome • Various phylogenetics studies (via CIPRES gateway) • Calculate phylogenetic trees

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for read mapping & variant calling

DNA reads in FASTQ format Goal: identify simple variants, e.g., • single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs) Read mapping, i.e., Reference CACCGGCGCAGTCATTCTCAT pairwise alignment: genome in AAT

BFAST, BWA, … FASTA format ||||||||||| |||||||||||| CACCGGCGCAGACATTCTCAT AAT • short insertions & deletions (indels) Alignment info Variant calling: CACCGGCGCAGTCATTCTCATAAT in BAM format GATK, … |||||||||| ||||||||||| CACCGGCGCA ATTCTCATAAT

Variants: SNPs, indels, others

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy

BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The CIPRES gateway lets biologists run phylogenetics codes at SDSC via a browser interface; http://www.phylo.org/index.php/portal

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational challenges abound in bioinformatics

• Large amounts of data, which can grow substantially during analysis • Complex workflows, often with different computational requirements along the way • Parallelism that varies between steps in the workflow pipeline • Large shared memory needed for some analyses

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Distributed Workflow-Driven Analysis of Biological Big Data

Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO

So, what is a scientific workflow?

Scientific workflows emerged as an answer to the need to combine multiple Cyberinfrastructure components in automated process networks.

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO The Big Picture is Supporting the Scientist

From “Napkin Drawings” to Executable Workflows Circonspect

Combine Results PHACCS Fasta File

Average Genome Size

Conceptual SWF

Executable SWF

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Scientific Workflow Automation Technologies @ SDSC

• Housed in San Diego Supercomputer Center at UCSD since 2004

• Mission: Support CI projects, scientists and engineers for computational practices involving process management

• Research and development focus • Scientific workflow management • Data and process provenance • Distributed execution using scientific workflows • Engineering and streaming workflows for environmental observatories • Fault tolerance in scientific workflows • Sensor network management and monitoring • Role of scientific workflows in eScience infrastructures • Understanding collaborative work in workflow-driven eScience

• Scientific collaborations • Bioinformatics, Environmental Observatories, Oceanography, Computational Chemistry, Fusion, Geoinformatics, … SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

Workflows are Used as Toolboxes in Biological Sciences

– Assemble complex processing easily

– Access transparently to diverse resources

– Incorporate multiple software tools Acquisition Generation – Assure reproducibility Data – Community development model

Workflows foster Data Analysis collaborations! • Flexibility and synergy • Optimization of resources Publication • Increasing reuse Archival • Standards compliance Data Need SANexpertise DIEGO SUPERCOMPUTER to identify CENTER which tool to use when and how! Require computation models to scheduleUNIVERSITY and optimize OF CALIFORNIA, execution! SAN DIEGO Kepler is a Scientific Workflow System

www.kepler-project.org

• An open collaboration • Builds upon the open-source … initiated August 2003 Ptolemy II framework • Kepler 2.4 released 04/2013

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow

KEPLER = “Ptolemy II + X” for Scientific Workflows SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO CAMERA Example:

Using Scientific Workflows and Related Provenance for Collaborative Metagenomics Research

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) http://camera.calit2.net

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO CAMERA is a Collaborative Environment

Data Cart Multiple Available User Workspace Mixed collections of Single workspace with CAMERA Data (e.g. projects, samples) access to all data and Data Discovery results (private and GIS and Advanced query shared) options

Group Workspace Share specified User Data Analysis Workspace data with collaborators Workflow based analysis SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Workflows are a Central Part of CAMERA

More than 1500 workflow submissions monthly!

Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database

SAN DIEGO SUPERCOMPUTER CENTER All can be reached through the CAMERA portal at: UNIVERSITY OF CALIFORNIA,http://portal.camera.calit2.net SAN DIEGO Pushing the boundaries of existing infrastructure and workflow system capabilities! • Increase reuse • Increase programmability by end users • Increase resource utilization • Make analysis a part of the end-to-end scientific model from data generation to publication

Add to these large amounts of next generation sequencing data!

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO bioKepler is a coordinated ecosystem of biological and technological packages! www.kepler-project.org www.bioKepler.org

Cyberinfrastructure platforms Kepler and Provenance Framework bioKepler BIO COMPUTE BioLinux Galaxy Hadoop Stratosphere

+

• Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

SAN DIEGO SUPERCOMPUTER CENTER

Improvement on usability and programmabilityUNIVERSITY OF CALIFORNIA, by SAN endDIEGO users! What can we do for you?

Access to technology and biology packages: • Bio-Linux • Galaxy via Amazon MapReduce BLAST • Amazon Cloud • Hadoop • Stratosphere Partners • Individual www.bioKepler.org researchers and labs • Research projects, • Training e.g., CAMERA • Academic institutions • Workflow-driven data- and compute- • Private labs, e.g., intensive process JCVI • Consulting • Designing, scaling and tracking JCVI selection criteria • Programmability pipelines and workflows • Modularity • Development Services • Customizability • Production workflows for you • Scalability SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Thanks! & Questions…

Ilkay Altintas [email protected]

How to download Kepler? https://kepler-project.org/users/downloads Please start with the short Getting Started Guide: https://kepler-project.org/users/documentation

SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Building a Semantic Information Infrastructure

Research Review: SDSC Industrial Partnership Program

Amarnath Gupta

Advanced Query Processing Lab San Diego Supercomputer Center, University of California at San Diego

SAN DIEGO SUPERCOMPUTER CENTER Lots of Data, Little Glue

A Science Enterprise An Industrial Enterprise • Lab resources • Customer data • Experiment design • Product specifications • Reagent catalogs • Product sales data • Experimental data • Production process data • Chains of Derived Data • Customer call records • Analysis Results (incl. • Internal memos outputs of software tools) • Legal documents • External Information • Emails • Publications and • Social Intranet Presentations • … • … Wide variation in data types, models, volume, systems, usage, updates, … What binds them together? SAN DIEGO SUPERCOMPUTER CENTER Some Consequences of Not Having a Glue

• An Orphan Disease Researcher • Spends five months to determine that her disease of interest relates to the mouse version of a gene studied in breast cancer • A Health Institution • Takes over a year to integrate clinical trial management data, drug characteristics, and patient reports to get a 360 deg. view of drug effects • A Reagent Vendor • Spends significant time and money to get to determine the effectiveness of a product-line • A Legal Department • Takes much longer to discover relevant documents and data for a complex, multi-party litigation • A Utilities Company • Cannot easily perform Synchrophasor analytics for grid behavior understanding because SCADA, EMS, PMU data cannot be integrated

SAN DIEGO SUPERCOMPUTER CENTER Integration through

We have been building NIF, a Semantic for Neuroscience  What NIF is designed to • In NIF answer • Data can be relational, XML, RDF, OWL, wiki content, manuscripts, publications, blogs, multimedia, annotations, … • Any new domain goes through  Some unanticipated questions semantic processing . Are NIH studies gender-biased? • Every piece of data gets some Which? semantic markup . Is the CRE mouse line being used • Data are integrated and searched by anyone? using ontological indices • Any search or discovery process . Which diseases have data but is interpreted through an ontology have been underfunded? • Any workflow that performs a . Is the method section in this paper mining-style computation utilizes under-specified? semantic properties SAN DIEGO SUPERCOMPUTER CENTER Technology Research

• Semantic Prospecting for any industry and application domain • Semi-automatic Construction of Semantic Models for any problem domain • Developing general-purpose, yet domain-specific Semantic Search Engines for heterogeneous data • Complex Information Discovery using semantic graph analysis and mining-style computation • Multi-domain Information Integration using semantic information bridges

SAN DIEGO SUPERCOMPUTER CENTER Why SDSC? • Semantic processing can be complex and costly • Domain model construction (incl. active learning) • Semantic indexing and correlation indexing • Graph query processing • Semantic search • Landscape and discovery analytics • The SDSC infrastructure handles complexity at scale • Configurable compute nodes • Large-Memory Systems • SSD drives Let SDSC help your infrastructure, research and service needs

SAN DIEGO SUPERCOMPUTER CENTER Biomedical data integration system and web search engine

Julia Ponomarenko, PhD Michael Baitaluk, PhD

San Diego Supercomputer Center

SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data

SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data

Databases Web pages (2,000+)

How can a researcher embrace such amount of data in their entirety?

SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data

Databases Web pages (2,000+)

Data Integration Resources

SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data

Databases Web pages (2,000+)

Data Integration Resources This leaves a researcher to work with partial, incomplete, incomprehensive data sets!

SAN DIEGO SUPERCOMPUTER CENTER Sequences Networks Variations Data Publications (0.1 PB) Annotations Taxonomies Epigenetic Structures Expression Biochemical Data Data Data

Databases Web pages (2,000+)

Biological Ontologies The Semantic Web technologies

Data Warehouse

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

SAN DIEGO SUPERCOMPUTER CENTER Database web pages Other web pages

For each ontological term A Automatically extract and page X, calculate the data and map them relevance score of X to A into the internal database schema SAN DIEGO SUPERCOMPUTER CENTER User Community

Web-portal & API Java-application integromeDB.org BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data

SAN DIEGO SUPERCOMPUTER CENTER integromedb.org

SAN DIEGO SUPERCOMPUTER CENTER

SAN DIEGO SUPERCOMPUTER CENTER

SAN DIEGO SUPERCOMPUTER CENTER

SAN DIEGO SUPERCOMPUTER CENTER Integromedb.org visit statistics

SAN DIEGO SUPERCOMPUTER CENTER User Community

Web-portal & API Java-application integromeDB.org BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data

SAN DIEGO SUPERCOMPUTER CENTER Your User Community

Web-portal & APIs Applications

Your Database

Public Data on the Web User’s Private Data

SAN DIEGO SUPERCOMPUTER CENTER Your User Community

Your Google-like “Search-in-the-Box” Appliance

SAN DIEGO SUPERCOMPUTER CENTER PACE Predictive Analytics And Data Mining Research and Applications Natasha Balac, Ph.D. Director, PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO PACE: Closing the gap between Government, Industry and Academia

PACE is a non-profit, public Inform, Educate and educational organization Train

Bridge the Develop To promote, educate and innovate Industry and Standards o Academia and in the area of Predictive Analytics Gap Methodology oTo leverage predictive analytics to Predictive improve the education and well Analysis

Provide Center of High being of the global population and Predictive Excellence Performance Analytics Scalable economy Services Data Mining oTo develop and promote a new, multi-level curriculum to broaden Data Mining Foster Repository of Research Very Large and participation in the field of Data Sets Collaboration predictive analytics

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Foster Research and Collaboration

• Fraud Detection

Inform, • Modeling user behaviors Educate and Train • Smart Grid Analytics Bridge the Develop Industry and Standards and Academia Methodology Gap • Solar powered system modeling Predictive Analysis Center of Provide High • Excellence Microgrid anomaly Predictive Performance Analytics Scalable Services Data Mining detection • Distributed Energy Data Mining Repository of Foster Very Large Research and Generation Collaboration Data Sets • Manufacturing • Sport Analytics • Genomics

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO UCSD Smart Grid

• UCSD Smart Grid sensor network data set • 45MW peak micro grid; daily population of over 54,000 people • Self-generate 92% of its own annual electricity load • Smart Grid data – over 100,000 measurements/sec • Sensor and environmental/weather data • Large amount of multivariate and heterogeneous data streaming from complex sensor networks • Predictive Analytics throughout the Microgrid

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO 4 V’s of Big Data

IBM, 2012

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO What to do with big data? ERIC SALL(IBM)’s list • Big Data Exploration • To get an overall understanding of what is there • 360 degree view of the customer • Combine both internally available and external information to gain a deeper understanding of the customer • Monitoring Cyber-security and fraud in real time • Operational Analysis • Leveraging machine generated data to improve business effectiveness • Data Warehouse Augmentation • Enhancing warehouse solution with new information models and architecture

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Big Data – Big Training

• “Data Scientist” • The “Hot new gig in town” • O’Reilly report • Data Scientist: The Sexiest Job of the 21st Century • Harvard Business Review, October 2012 • The future belongs to the companies and people that turn data into products • Article in Fortune • “The unemployment rate in the U.S. continues to be abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Science Job Growth

By 2018 shortage of 140-190,000 predictive analysts and 1.5M managers / analysts in the US

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO PACE Education • Data mining Boot Camps • Boot Camp 1 • September 12-13, 2013 • Boot Camp 2 • October 17- 18, 2013 • On-site Personalized Boot Camp • 10-15; 20-30 • Tech Talks – every 3rd Wednesday • Workshops, Webinars • Interesting Reads, “Tool-off” • “Bring your own data”

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Predictive Analytics Consulting

Inform, Educate and Train

Bridge the Develop • Full-service consulting and Industry and Standards and Academia Gap Methodology development services

Predictive • Targeted projects with Analysis Center of Provide Excellence High industry and agency partners Predictive Performance Analytics Scalable Services Data Mining • Applied and applications oriented research Data Mining Foster Repository of Very Large Research and Collaboration • Technical expertise and Data Sets industry experience

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Mike Gualtieri's blog at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions?

• http://pace.sdsc.edu/

• For further information, contact Natasha Balac [email protected]

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance, Modeling, and Characterization (PMaC) Lab

Laura Carrington, Ph.D PMaC Lab Director

University of California, San Diego San Diego Supercomputer Center

PMaC Performance Modeling and Characterization The PMaC Lab

Mission Statement: Research the complex interactions between HPC systems and applications to predict and understand the factors that affect performance and power on current and projected HPC platforms.

• Develop tools and techniques to deconstruct HPC systems and HPC applications to provide detailed characterizations of their power and performance.

PMaC Performance Modeling and Characterization The PMaC Lab Utilize the characterizations to construct power and performance models that guide: – Improvement of application performance 2 Gordon Bell Finalists, DoD HPCMP Applications, NSF BlueWaters, etc. – System procurement, acceptance, and installation DoD HPCMP procurement team, DoE upgrade of ORNL Jaguar, installation of NAVO PWR6 – Accelerator assessment for a workload Performance assessment and prediction for GPUs & FPGAs – Hardware customization/ Hardware-Software co-design Performance and power analysis for Exascale – Improvement of energy efficiency and resiliency Green Queue Project, DoE SUPER Institute, DoE BSM, etc.

PMaC Performance Modeling and Characterization Automated tools & techniques to characterize HPC systems & applications

HPC System HPC Application Characterize the computational Characterize the computational (& ) patterns (& communication) behavior of affect the overall power draw application

Loop #1

Loop #2

Loop #3

Func. Foo

Design software- and hardware-aware energy and performance optimization techniques

PMaC 113 Performance Modeling and Characterization Hardware Customization

• 10x10 project Analysis of 37 workloads – PI: A. Chien @ U. of Chicago Space # Configurations in # Unique Avg. Energy – Heterogeneous processor Name search space Configurations selected Savings (%) Full 2652 33 68.8 architecture 2N 469 26 66.0 Restricted 224 23 65.5 Cluster 10 10 63.7 • Reconfigurable memory hierarchy ~70% energy savings – L1, L2, and L3 • Selection of energy-optimal configuration – Simulation – Reuse distance

PMaC 114 Performance Modeling and Characterization PMaC’s Green Queue Framework (optimizing for performance & power) Goal: Use machine and application characterization to make application-aware energy optimizations during execution 1024 cores Gordon (4.8%) (21%) (5.3%) (6.5%) (Energy Savings %) (19%)

(32%) (6.5%) Power (kW)

Time (seconds)

PMaC 115 Performance Modeling and Characterization Questions

PMaC 116 Performance Modeling and Characterization Benchmarking and Tuning Big Data Software

David Nadeau, Ph.D.

SAN DIEGO SUPERCOMPUTER CENTER What to tune? • Application benchmarking/profiling finds what to tune for one system

• But... we want answers for many systems for now and (hopefully) years to come

• Requires understanding fundamental trends • Processor speeds, core counts, memory bandwidth, memory latency, network bandwidth, etc.

SAN DIEGO SUPERCOMPUTER CENTER A few trends

Clock speeds almost flat Cycles/math op almost flat

Raw math ability/core is not improving.

SAN DIEGO SUPERCOMPUTER CENTER A few trends

SPEC float/CPU way up Core count/CPU way up

Float math/core (thread) up 8 to 20%/year Why is this trend upwards if math speed is flat?

SAN DIEGO SUPERCOMPUTER CENTER A few trends

DDR bandwidth Memory bandwidth/core up 15%/year up 0 to 12%/year Improvement is primarily from better memory bandwidth, not math ability.

SAN DIEGO SUPERCOMPUTER CENTER What this means • SPEC, Dhrystones, etc. are now dominated by memory performance not math. • 1 Multiply = 1 cycle • 1 Memory access = 300 cycles • Not likely to change soon

• Application performance is dominated by memory access costs.

• So tune access patterns or data order.

SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume Task: Store 3D volume in memory. Array of arrays of arrays, one big array, etc.

Worst (blue): array of array of arrays. Best (black): one big array with simple 3D indexing. Fewer memory references is much faster.

SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume sweep Task: Sweep plane thru volume. 6 axis directions: +X, -X, +Y, -Y, +Z, -Z

4 sweep directions are slow. 2 are 10x to 30x faster.

Sweep in natural data order is much faster.

SAN DIEGO SUPERCOMPUTER CENTER Example: 3D volume bricking Task: Sweep in any direction. Brick it: Cube of cubes

Unbricked 2x2x2 4x4x4 8x8x8 16x16x16 Makes all sweep directions have similar performance.

Bricked data order is more cache friendly.

SAN DIEGO SUPERCOMPUTER CENTER Example: Desktop compression Task: Real-time compress & send desktop. Many codecs.

Clever codecs slower, despite smaller result data.

Codecs with fewer memory references are faster.

SAN DIEGO SUPERCOMPUTER CENTER Example: Parallel compositing Task: Composite N images on cluster of N nodes. Many algorithms.

Many small messages Few big messages Same amount of data

Better net use is faster.

SAN DIEGO SUPERCOMPUTER CENTER And so on…

SAN DIEGO SUPERCOMPUTER CENTER Gordon: A First-of-its Kind Data-intensive Supercomputer

SDSC Research Review June 5, 2013

Shawn Strande Gordon Project Manager

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Gordon is a highly flexible system for exploring a wide range of data intensive technologies and applications High performance flash technology Massively large memory environments High performance parallel file system

Scientific databases

On-demand Hadoop and data intensive environments Complex application architectures

New algorithms and optimizations Gordo High speed InfiniBand interconnect

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Gordon is a data movement machine

Sandy Bridge Compute Nodes Flash based I/O Nodes (64) (1,024) • 300 TB Intel eMLC flash • 64 TB memory • 35M IOPS • 341 Tflop/s

Large Memory Nodes • vSMP Foundation 5.0 • 2 TB of cache-coherent Dual-rail, 3D Torus memory per node Interconnect • 7GB/s “Data Oasis” Lustre PFS 100 GB/sec, 4 PB

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SSD latencies are 2 orders of magnitude lower than HDD’s (that’s a big deal for some data intensive applications)

Typical hard drive Solid State Disk ~ 10 ms (.010s) ~ 100 µs (.0001s) IOPS = 200 IOPS = 35,000/3000 (R/W)

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Protein Data Bank (flash-based I/O node)

The RCSB Protein Data Bank (PDB) is the leading primary database that provides access to the experimentally determined structures of proteins, nucleic acids and complex assemblies. In order to allow users to quickly identify more distant 3D relationships, the PDB provides a pre-calculated set of all possible pairwise 3D protein structure alignments.

Although the pairwise structure comparisons are computationally intensive, the bottleneck is the centralized server that is responsible for assigning work, collecting results and updating the MySQL database.

Using a dedicated Gordon I/O node and the associated 16 compute nodes, work could be accomplished 4-6x faster than using the OSG

Configuration Time for 15M alignments speedup Reference (OSG) 24 hours 1 Lyndonville 6.3 hours 3.8 Taylorsville 4.1 hours 5.8

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO OpenTopography Facility (flash-based I/O node) The NSF funded OpenTopography Facility provides online access to -oriented high- resolution LIDAR topography data along with online processing tools and derivative products. Point cloud data are processed to produce digital elevation models (DEMs) - 3D representations of the landscape.

Local binning algorithm utilizes the elevation information from only the points inside of a circular search area with user specified radius. An out-of-core (memory) version of the local binning algorithm exploits secondary storage for saving intermediate results when the size of a grid exceeds that of memory.

Using a dedicated Gordon I/O node with the fast SSD drives High-resolution bare earth DEM of San reduces run times of massive Andreas fault south of San Francisco, Illustration of local binning geometry. concurrent out-of-core processing generated using OpenTopography Dots are LIDAR shots ‘+’ indicate jobs by a factor of 20x LIDAR processing tools locations of DEM nodes at which Source: C. Crosby, UNAVCO elevation is estimated based

Dataset and processing configuration # concurrent jobs OT Servers Gordon ION Speed-up Lake Tahoe 208 Million LIDAR returns 1 3297 sec 1102 sec 3x 0.2-m grid res and 0.2 m rad. 4 29607 sec 1449 sec 20x

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IntegromeDB (flash-based I/O node)

The IntegromeDB is a large-scale data integration system and biomedical search engine. IntegromeDB collects and organizes heterogeneous data from over a thousand databases covered by the Nucleic Acid and millions of public biomedical, biochemical, drug and disease-related resources

IntegromeDB is a distributed system stored in a PostgreSQL database containing over 5,000 tables, 500 billion rows and 50TB of data. New content is acquired using a modified version of the SmartCrawler web crawler and pages are indexed using Apache Lucene.

Project was awarded two Gordon I/O nodes, the accompanying compute nodes and 50 TB of space on Data Oasis. The compute nodes are used primarily for post-processing of raw data. Using the I/O nodes dramatically increased the speed of read/write file operations (10x) and I/O database operations (50x).

Source: Michael Baitaluk (UCSD) Used by permission 2013

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Structural response of bone to stress (vSMP)

The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.

• 5 million quadratic, 8 noded elements • Model created with custom Matlab application that converts 253 micro CT images into voxel- based finite element models

Source: Matthew Goff, Chris Hernandez (Cornell University) Used by permission. 2012

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Managing HPC Systems at the National and Campus Levels

Rick Wagner, Ph.D. Candidate HPC Systems Manager, SDSC

SAN DIEGO SUPERCOMPUTER CENTER TSCC Systems

Gordon

Trestles

Data Oasis Dev SAN DIEGO SUPERCOMPUTER CENTER Challenges Enabling unique & differentiating features of each system… Gordon Technology

Trestles Policy

TSCC Business Model

Data Oasis Agnosticism Dev Isolation …without maintaining N different systems.

SAN DIEGO SUPERCOMPUTER CENTER Solutions

Part I: • Common systems management – Rocks • Shared staff responsibility across systems

Part II: • Build on (cower behind) core SDSC services

SAN DIEGO SUPERCOMPUTER CENTER User Services

SAN DIEGO SUPERCOMPUTER CENTER Operations

SAN DIEGO SUPERCOMPUTER CENTER Storage

SAN DIEGO SUPERCOMPUTER CENTER VM Hosting

SAN DIEGO SUPERCOMPUTER CENTER Security

SAN DIEGO SUPERCOMPUTER CENTER Networking

SAN DIEGO SUPERCOMPUTER CENTER SDSC Sandbox

With support from:

SAN DIEGO SUPERCOMPUTER CENTER SDSC’s Myriad Areas of Expertise

Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Visualization Services Group

Lead – Amit Chourasia • Develops new ways to represent data visually • Research collaborations in many science and engineering disciplines • Visualization support and consulting • Provides visualization education and training • Website: http://www.sdsc.edu/us/visservices/

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Ross Walker ([email protected]) Andreas Goetz ([email protected])

Technical Expertise • GPU Computing • CUDA Teaching Center • Parallel computing • Workstation and Cluster Design for Biomolecular Simulations and Computational Drug Discovery • Cloud computing

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Ross Walker ([email protected]) Andreas Goetz ([email protected])

Scientific Expertise • Molecular Dynamics • Quantum Chemistry • Force Field Development, Automatic Parameter Fitting • Drug Discovery • Biomolecular Simulations

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC Spatial Information Systems Lab Contact: Ilya Zaslavsky ([email protected]) • Services-based spatial information Hydrologic Information System (largest in the world) integration infrastructure • Advanced online and high performance GIS and Geospatial Databases • Information Interoperability in the geosciences • Long-term spatial data preservation • Information models and data standards (adopted by federal government and internationally) • Innovative user interfaces for connecting people, projects, resources… CZO • Large distributed data systems and catalogs (for scientific field observations from NSF EarthCube hydrology, critical zone and others) Brain data integration Mexico Health Atlas

SANEcosystem DIEGO SUPERCOMPUTER CENTER Services Katrina Dashboard portal UNIVERSITY OF CALIFORNIA, SAN DIEGO High Performance Wireless Research and Education http://hpwren.ucsd.edu/ Network

An extension of Area Situational Awareness for Public Safety Network (ASAPnet) Existing ~60 HPWREN/ASAPnet fire agency sites in June 2013 (from Google Earth KML object)

Project partners include: • the County of San Diego • the California Department of Forestry and Fire Protection (CAL FIRE) • the United States Forest Service (USFS) • San Diego Gas and Electric (SDG&E) • San Diego State University (SDSU)

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Mathematical anthropology

The identification of cohesive subgroups in large networks is of key importance to problems spanning the social and biological sciences. A k-cohesive subgroup has the property that it is resistant to disruption by disconnection by removal of at least k of its nodes. This has been shown to be equivalent to a set of vertices where all members are joined by k independent vertex-independent paths (Menger’s theorem).

Doug White (UCI) and his collaborators are using software developed using R and the igraph package to study social networks.

The software was parallelized using the R multicore package and ported to Gordon’s vSMP nodes by SDSC computational scientist Robert Sinkovits. Analyses for large problems (2400 node Watts-Strogatz model) are achieving estimated speedups of 243x on 256 compute cores. Work is underway to identify cohesive subgroups in large co-authorship networks

James Moody, Douglas R. White. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Impact of high-frequency trading

To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion

Run times for LOB construction of heavily traded NASDAQ securities (June 4, 2010)

Symbol wall time (s) wall time (s) speedup orig. code opt. code SWN 8400 128 66x AMZN 55200 437 126x AAPL 129914 1145 113x

Optimizations by SDSC computational scientists Robert Sinkovits and DongJu Choi to the original thread-parallel code resulted in greater than 100x speedups. It is now possible to analyze entire day of NASDAQ activity in a few hours using 16 Gordon nodes. With new capabilities, beginning to consider analysis of options data with 100x greater memory requirements.

Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC’s Education, Outreach and Training (EOT) Programs

Diane Baxter, Ph.D., Ange Mason, Jeff Sale San Diego Supercomputer Center University of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO SDSC EOT Program Challenges • Prepare Teachers to teach their students the skills and knowledge for a future in which . . . • Technology Power • Computational skills Success • Give Students Access to the computational tools, knowledge and thinking skills to seek their dreams and create their future • Train researchers at all levels, to use HPC and Data-Intensive Computing tools to accelerate discovery in science, engineering, technology, mathematics, and other data-related fields

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Thanks! & Questions…

Ilkay Altintas [email protected]

SAN DIEGO SUPERCOMPUTER CENTER

UNIVERSITY OF CALIFORNIA, SAN DIEGO Industrial Engagement at SDSC

• Industrial Partners Program (IPP) • “Gateway” program • Annual membership • Large company, small company, individual categories • CLDS • Focus on Big Data • PACE • Focus on Predictive Analytics • Research Contracts • Specific project defined • Service Agreements • For use of SDSC resources/services

SAN DIEGO SUPERCOMPUTER CENTER

THANK YOU!

SAN DIEGO SUPERCOMPUTER CENTER