Fujitsu World Tour 2017 shaping tomorrow with you

Fujitsu North America Technology Forum 2017

Intel Inside®. Powerful Productivity Outside.

0 Copyright 2017 FUJITSU New Computing Paradigms: Architecture Innovations beyond Moore’s Law TAKESHI HORIE Head of Computer Systems Laboratory FUJITSU LABORATORIES LTD.

Intel Inside®. Powerful Productivity Outside.

1 Copyright 2017 FUJITSU Why computing now?

 Data explosion  Data is generated by many IoT devices and the amount of data is exploding.  Computing creates knowledge and intelligence from data. But traditional computing cannot handle it.

 End of Moore’s law  For 50 years we have enjoyed device technology scaling. But that is ending.

Fundamentally rethink new computing architecture

2 Copyright 2017 FUJITSU Demand for Computing and Fujitsu Computer Systems

3 Copyright 2017 FUJITSU Computer performance

 Since ENIAC was developed 70 year ago, computer performance is increasing twice every 1.5 years.

1.E+12

1.E+09 2x / 1.5 years 1.E+06 ENIAC, 1946 U.S. federal government

1.E+03 ENIAC

1.E+00 Computations per second per computer per second per Computations 1930 1950 1970 1990 2010 4 Copyright 2017 FUJITSU Computing demand for scientific applications

 Although computing has enabled applications in variety of fields, still much higher computing power is required to solve complex problems of the real world. Heart simulation Tsunami simulation Joint research with the Joint research with Tohoku University University of Tokyo - International Research Institute for Disaster Life science and drug manufacturing

Global change prediction for reducing disaster

Industrial innovation

New material and energy creation

Origin of matter and the universe

5 Copyright 2017 FUJITSU Computing demand for financial applications

 Tokyo Stock Exchange, Inc. (TSE) is one of the world's top trading market and lists around 3,800 brands. Daily trading value exceeds three trillion yen.  Trading volume is constantly increasing year by year  For high frequency trading, response time is reduced from 2ms to 500us in 5 years

900 800 Trading Volume Response Time of TSE

Million st 700 in TSE 1 section 2ms 600 500 400 900μs 300 200 500μs 100 0 1949 2015 2010 2012 2015 6 Copyright 2017 FUJITSU Fujitsu computer systems

Supercomputer PRIMEHPC FX10 (2011) VPP-500 (1992)

VP-100 GS21 (1982) (2002) M-1800 Mainframe SPARC M10 (1990) (2013)

M-780 Enterprise Servers (1985) PRIMEQUEST DS90 M-190 (2005) (1991) (1976) FACOM230-10 Ubiquitous Arrows FACOM100 (1965) OAYSYS100 FM TOWNS FM V Terminal (2011) (1954) (1980) (1989) (1993) 1950 1960 1970 1980 1990 2000 2010

7 Copyright 2017 FUJITSU Fujitsu microprocessors

SPARC64 Virtual Machine Architecture Supercomputer XIfx Software On Chip High Performance High High-speed Interconnect SPARC64 SPARC64 HPC-ACE IXfx X+ SPARC64 System on Chip SPARC64 VIIIfx X Hardware Barrier UNIX GS21 Multi-core Multi-thread SPARC64 VII M2600 SPARC64 L2$ on Die VI Non-Blocking $ SPARC64 GS21 V + O-O-O Execution SPARC64 Super-Scalar V GS21 900 Single-chip CPU GS21 Mainframe

High Reliability High 600 Store Ahead GS8900 Branch History GS8800B GS8800 SPARC64 Prefetch GP SPARC64 GS8600 $ECC GP Register/ALU Parity SPARC64 II

Instruction Retry SPARC64 $ Dynamic degradation RC/RT/History - 1999 2000 - 2003 2004 - 2007 2008 - 2011 2012 - 2015 2016 - 8 Copyright 2017 FUJITSU Fujitsu high performance computing

 Fujitsu provides many HPC solutions to satisfy various customer demands.  Support for both supercomputers with original CPU and x86 cluster systems  Post-K will be developed with collaboration with RIKEN and ARM

Original CPU

K computer PRIMEHPC FX100 Post-K (Co-developed with RIKEN) (Co developed with RIKEN and ARM) x86 Cluster

BX900 Cluster (Co-developed with JAEA) Oakforest PACS 9 Copyright 2017 FUJITSU IoT and Data Explosion

10 Copyright 2017 FUJITSU IoT connects everything

 By 2020, 50 billion devices will be connected and generate data constantly.

50

More than 50 40 Number of devices billion devices exceeded in 2020 30 the world wide Only 1 million populations 20 PCs were

Billions Billions devices of connected to the Internet 10 World wide populations

1990 2000 2010 2020 Year (src: CISCO) 11 Copyright 2017 FUJITSU Data explosion

 As amount of data is exploding, it exceeds capability of traditional ICT.  Need new processing to create valuable information from unstructured data. 1 ZB 40 ZB 1 YB 1 ZB=1021 1 YB=1024

Amount of data will reach: 40 Zetta by 2020 1 Yotta Byte by 2030

Data Explosion Unstructured data

Amount of Amount data IOT, sensors

Structured data Business data, RDB 1990 2000 2010 2020 Year

12 Copyright 2017 FUJITSU Data lifecycle and processing

 New processing throughout data lifecycle creates knowledge and Intelligence. Provide solutions Extract value from with knowledge and Knowledge volume of data AI Intelligence Integration

Data Explosion

Cloud IoT Collect and distribute Pre-process data at Information the edge

13 Copyright 2017 FUJITSU New computing for data explosion  New computing extracts knowledge and intelligence from data, and enables delivery of new applications and services.

New applications and services Knowledge and Extract value intelligence computing from volume Intelligence

Knowledge

Information

Data processing Numerical computing 14 Copyright 2017 FUJITSU Technology Trend for Computing

15 Copyright 2017 FUJITSU Moore’s law and microprocessor trend

Moore’s law drives Power consumption End of processor performance limits performance Moore’s law 109

108

107 Performance trend of

6 2025 10 Microprocessor (CAGR) r 2005 105

104

103

102

101

100 # of Cores 1970 1980 1990 2000 2010 2020 2030 Year Source: Estimated based on Stanford, K. Rupp

16 Copyright 2017 FUJITSU Trade-off line of Moore’s law

 Device technology scaling has brought Power efficiency*(Performance)2 = K∝s5 104 higher performance as well as higher 2010 2025 s: Scaling factor 1990 2000 power efficiency for these 50 years. .) Moore’s trade-of line 103 a.u advancement  The trade off line is determined by device technology at each generation. As 102 Mobile technology scales, the trade-off line 10 moves upward. ( efficiency Power

Server  Technology scaling will stop around 2025. 1 102 103 104 105 Performance (a.u.)

Technology scaling will never be a driver for computing

17 Copyright 2017 FUJITSU Computing innovations

 Continue to create new computing paradigms for unlimited performance growth

New Computing Paradigm

Domain Specific Computing

Performance Adapted

Conventional Computing Paradigm 2010 2020 2030 Year

18 Copyright 2017 FUJITSU Computing Architecture Innovation

19 Copyright 2017 FUJITSU Data explosion and challenges

 Overcome challenges by innovation for computing and data processing

Challenges

• Process technology

• Network bandwidth • Power consumption

• Computing power of of data Data Unstructured data

explosion Amount Amount Structured data

2000 2010 2020 2030 Year

20 Copyright 2017 FUJITSU Our proposal for computing architecture innovation  Create new computing paradigm for data explosion 1YB (1024B) Cloud Hyperconnected

System Computing Cloud

40ZB(40*1021B) of of data Data Unstructured data Moore’s explosion New Computing Computing Challenges Amount Amount Law Architecture • Process Technology電力,伝送, 集積,処理 Structured data • Network Bandwidthの限界 • Power Consumption • Computing Power 2000 2010 2020 2030 Year

21 Copyright 2017 FUJITSU Hyperconnected Cloud

 R&D vision and strategy: “Hyperconnected Cloud”  Web scale ICT provides computing and data processing power through service-oriented connection  AI and security are embedded at every layer to create knowledge in safe and secure society

22 Copyright 2017 FUJITSU New computing architecture

 From numerical to media, knowledge, and intelligence processing

Approximate Quantum Computing Computing Brain-Inspired

New metrics metrics New Neural Computing Computing (Inference) Accelerator

Limit of Neural Computing Moore’s Law Conventional (Learning) Computing Supercomputer Processing 23 Copyright 2017 FUJITSU Direction of new computing architecture

Conventional New Computing

Extreme Many Core Parallelism

Simple and Specific General Purpose Core

Strict Accuracy Relaxed Accuracy

24 Copyright 2017 FUJITSU Domain specific computing

 Achieve extremely high performance, simple operation and low cost by

specializing hardware and software in specific application domains

ApproximateApproximate Quantum Quantum- ComputingComputing Computing Inspired Brain-Inspired

New metrics metrics New Neural Computing Computing Computing (Inference) Media ProcessingAccelerator

Limit of NeuralNeural Computing Moore’s Law Conventional (Learning)Computing Computing Supercomputer Processing 25 Copyright 2017 FUJITSU Media Processing

26 Copyright 2017 FUJITSU Needs for image retrieval

 Office workers routinely create and store numerous documents that contain images like presentation materials.  Stored massive image materials are not reused sufficiently.  10% of work-time is wasted at offices to search for wanted documents.

Needs more intuitive search method “Search by image” increases productivity

27 Copyright 2017 FUJITSU Partial image retrieval

 Find images based on matches with a part of the query image ・Partial match Query image Search results ・Enlarged/Reduce image

Massive Search image DB Results

 General-purpose server takes long processing time for massive calculations of partial matching Requires acceleration of partial image retrieval to search a target image intuitively and efficiently 28 Copyright 2017 FUJITSU

Image search acceleration system: demonstration  We developed technology for instantaneous searches of a target image from a massive volume of images

29 Copyright 2017 FUJITSU Image search acceleration system: architecture and implementation  Designed special engines for feature extraction and matching with FPGA Press release on Feb. 2nd 2016 F.E. 0 Dedicated processing unit for Partial image retrieval engine F.E. 1 feature extraction Simple & Specific Core (32-way parallelization) CPU FPGA 32 cores

Overall Control Feature Extraction F.E. 31 F.E. : Feature extraction Relaxed Accuracy

I/O Processing Matching Match 0 Extreme Parallelism Match 1 H.D. 0 Database 64-way H.D. 1 Dedicated processing unit x 6core for matching Match 5 (384-way parallelization) Server H.D. 63 H.D. : Hamming distance calculation

30 Copyright 2017 FUJITSU Image search acceleration system: performance and applications  “Search by image” makes document creation more productive and can be applied to medical and weather applications

More than 50 times

12,000 Documents Medical Weather hroughput T Image/sec 200 Image/sec FPGA Many core

Conventional Media domain specific server server

31 Copyright 2017 FUJITSU Neural Computing

32 Copyright 2017 FUJITSU Neural computing comes back again

 Deep Learning algorithm and enhanced computing capability have enabled much higher object recognition rate than ever since 2012. Conventional machine learning algorithm Output 0.30 y1 y 2 yn 0.25 Large difference 0.20 wij 0.15 Neural computing 0.10 Neural network 0.05 Input (Feedforwad) Improving every year

General object recognition recognition rate object General 0.00 2011 2012 2013 2014 2015 Input Feature Features Classification Results image extraction Manual design Automatic Input Feature Classification Results image extraction Features Learning Automatic extraction(Deep Learning) Inference 33 Copyright 2017 FUJITSU Computing for deeper neural network

 To achieve higher accuracy, neural network has been deeper and larger  Processing speed: computing for learning with deeper neural network is time consuming  Processing capacity: limited memory size on GPU is critical for larger neural network Neural network size trend 18 ~16GB 16 GPU Memory Size

14 NN Size(Batch=8) 12 10 8 ResNet 6

Memory Size [GB] Size Memory AlexNet 4 2 LeNet VGGNet

0

1998 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Year

34 Copyright 2017 FUJITSU Fastest learning w/ HPC technology

 Developed high-speed technology to process deep learning Press release on Aug. 9th 2016  Using "AlexNet," 64 GPUs in parallel achieve 27 times the speed of a single GPU for world's fastest processing

(64 GPUs) Conventional … 1.8x Same accuracy faster 1 GPU 64 GPUs Our approach (64 GPUs) 27x faster learning speed (60x faster execution speed)

35 Copyright 2017 FUJITSU Doubles deep learning neural network scale

 Developed technology to streamline Press release internal memory of GPUs to support on Sep. 21st 2016 growing neural network scale that works to heighten machine learning accuracy Conventional Our approach 2x more  Enabled neural network machine images learning of a scale up to twice what was capable with previous technology Same memory  Response after press release “How A New Technology Promises To Make Learning More Powerful Than It 4% more Already Is” accuracy By Kelvin Murae, Forbes 36 Copyright 2017 FUJITSU

TM Press release Deep learning processor : DLU on Nov. 29th 2016

 Dedicated architecture for deep learning Extreme Parallelism  Supercomputer’s interconnect Simple & Specific Core  Extremely low power design Relaxed Accuracy DLUTM (Deep Learning Unit)

Host I/F DPU-0 DPU DPE DPE DPE DPE DPE DPE

DPU-1 DPU DPE DPE DPE DPE DPE DPE HBM2 DPU DPU-n DPE DPE DPE DPE DPE DPE Max 100,000 DLU connection (Tofu interconnect)

37 Copyright 2017 FUJITSU Quantum-Mechanics-Inspired Computing

38 Copyright 2017 FUJITSU Motivation: combinatorial optimization problem

Various combinatorial optimization problems in real world

Rooting pattern 1stRooting pattern2nd 3rd … 1stRooting pattern2nd Customer23rd … Customer3 Customer1 Investment portfolio Disaster recovery Power delivery 1st 2nd Customer23rd Customer1 Customer1… Customer2 Customer2Customer3 … Customer1Customer2 Customer1Customer2 Vehicle routing problem: finding the optimal routes Customer3… Customer2000 Customer1Customer2 … Customer2 for delivering vehicles to 2,000 customers … Customer2000 Customer2 … Minimum Customer2000Customer2000 customer2 … Customer2000 cost? 7535 Customer2000 ~10 order combinations port ~107535 order combinations depot customer1 customer3 ~107535 order combinations

Calculation time increases exponentially We need to choose optimal route out of enormous depending on the customer numbers number of combinations to minimize the cost Need efficient approach to solve the explosion of combinations

39 Copyright 2017 FUJITSU Our strategy to solve optimization problem

Applicable to practical problems

Conventional Our goal processor • Locating power grid failure • Pick-up and delivery of 2000 depots

Slow Fast

Quantum Computer * • Locating failures in 20-breaker power grid • Map coloring

Limitation of problems * Quantum Annealing type Create high-speed and widely applicable architecture

40 Copyright 2017 FUJITSU Quantum-Mechanics-Inspired Computer

 Architecture to meet usability and scalability for combinatorial optimization  Solve practical problems by using CMOS digital design  Realize scalability for larger problems and speed enhancement Press release th  Features on Oct. 20 2016  Simple core reduces data movement and control overheads.  Massively-parallel stochastic search is implemented to accelerate search paths.

Further speed up achieved by parallelism Extreme Parallelism Multiple engines for larger problems

Speed up by parallel score calculation and transition Simple & Specific Core facilitation Relaxed Accuracy

41 Copyright 2017 FUJITSU Evaluation of our prototype

 Engine performance evaluated using FPGA implementation T h i s W o r k s 10,000 2 x 1,000

100 1000 x 12,000 x 10

Time to solution (sec) solution Timeto 1 6 x

0.1 Conventional FPGA Parallel Transition processor Score Facilitation *3.5-GHz Intel Xeon E5 Calculation 12,000 speedup confirmed by using 32-city traveling salesman problem

42 Copyright 2017 FUJITSU Demonstration

43 Copyright 2017 FUJITSU Ecosystem of combinatorial optimizer

 Collaborate with universities, research institutes and industries to apply our technologies to practical problems

Practical Application User Community

Delivery AI PoB Early Users CAD

Open Framework Software Development on Cloud Service Environment Research Universities Institute Enhanced Engine Combinatorial Optimizer Fujitsu

44 Copyright 2017 FUJITSU Approximate Computing

45 Copyright 2017 FUJITSU Approximate computing

 Optimizing accuracy to the target workload enables higher performance and

higher energy efficiency at the same time.

ApproximateApproximate Quantum Computing Computing Computing Brain-Inspired

New metrics metrics New Neural Computing Computing (Inference) Accelerator

Limit of Neural Computing Moore’s Law Conventional (Learning) Computing Supercomputer Processing 46 Copyright 2017 FUJITSU Summary

47 Copyright 2017 FUJITSU Computing innovations beyond Moore’s law

 Fujitsu will continue to innovate computing architecture

Quantum- New Computing Inspired

Computing Paradigm

Approximate Computing Neural Domain Specific Computing

Performance Computing Media Processing Conventional Computing Paradigm

2010 2020 2030 Year

48 Copyright 2017 FUJITSU 49 Copyright 2017 FUJITSU