Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook – Chief Software Architect Satyaki Koneru – Chief Technology Officer Ke Yin – Chief Scientist Dinakar Munagala – Chief Executive Officer THINCI, Inc. • www.thinci.com August 2017 Introduction • THINCI, Inc. “think-eye” is 5-year-old strategic/venture-backed technology startup • Develop silicon for machine learning, computer vision and other strategic parallel workloads • Provide innovative software along with a comprehensive SDK • 69-person team (95% engineering & operations) • Key IP (patents, trade secrets) – Streaming Graph Processor – Graph Computing Compiler • Product Status – Early Access Program started Q1 2017 – First edition PCIe-based development boards will ship Q4 2017 THINCI, Inc. • www.thinci.com August 2017 Architectural Objective Exceptional efficiency via balanced application of multiple parallel execution mechanisms Levels of Parallelism Key Architectural Choices • Task Level Parallelism • Direct Graph Processing • Thread Level Parallelism • Fine-Grained Thread Scheduling • Data Level Parallelism • 2D Block Processing • Parallel Reduction Instructions • Instruction Level Parallelism • Hardware Instruction Scheduling THINCI, Inc. • www.thinci.com August 2017 Task Level Parallelism Direct Graph Processing THINCI, Inc. • www.thinci.com August 2017 Task Graphs • Formalized Task Level Parallelism – Graphs define only computational semantics – Nodes reference kernels A – Kernels are programs – Nodes bind to buffers – Buffers contain structured data B C – Data dependencies explicit • ThinCI Hardware Processes Graphs Natively – A graph is an execution primitive D E F – A program is a proper sub-set of graph G THINCI, Inc. • www.thinci.com August 2017 Graph Based Frameworks • Graph Processing or Data Flow Graphs – They are a very old concept, for example Alan Turing’s “Graph Turing Machine”. – Gaining value as a computation model, particularly in the field of machine learning. • Graph-based machine learning frameworks have proliferated in recent years. Machine Learning Frameworks TensorFlow Lasagne Chainer CNTK maxDNN MxNet Neural Designer leaf cuDNN Karas DSSTNE Caffe MatConvNet Apache Kaldi Torch BIDMach deeplearning4j SINGA Caffe2 2011 2012 2013 2014 2015 2016 2017 THINCI, Inc. • www.thinci.com August 2017 Streaming vs. Sequential Processing • Sequential Node Processing Sequential Execution Streaming Execution – Commonly used by DSPs and GPUs 0 0 – Intermediate buffers are written back and forth to memory A A – Intermediate buffers are generally 1 2 non-cacheable globally 1 2 – DRAM accesses are costly • Excessive power B C B C • Excessive latency 3 4 5 3 4 5 • Graph Streaming Processor – Intermediate buffers are small D D (~1% of the original size) – Data is more easily cached 6 6 – Benefits of significantly reduced Node A Node B Node C Node D Nodes A,B,C,D memory bandwidth 1 3 5 6 • Lower power consumption 2 4 • Higher performance time time THINCI, Inc. • www.thinci.com August 2017 Thread Level Parallelism Fine-Grained Thread Scheduling THINCI, Inc. • www.thinci.com August 2017 Fine-Grained Thread Scheduling • Thread Scheduler Controller DMA Execution Thread Quad 0 Input/ – Aware of data Command Command Scheduler Special Op Unit Output Ring Unit Ring Unit Unit Processor 0 Processor 1 Processor 2 Processor 3 Thread Thread Thread Thread State State State State dependencies Read Only Inst. Scheduler Inst. Scheduler Inst. Scheduler Inst. Scheduler Cache Array Transfer SPU MPU SPU MPU SPU MPU SPU MPU (DMA) Read Write Read Write Unit Arbiter Cache – Dispatches threads when: Cache Array • Resources available Quad 1 Special Op Unit L2 Cache Processor 0 Processor 1 Processor 2 Processor 3 Thread Thread Thread Thread • Dependencies satisfied State State State State Instruction Inst. Scheduler Inst. Scheduler Inst. Scheduler Inst. Scheduler Unit SPU MPU SPU MPU SPU MPU SPU MPU Read Only – Maintains ordered AXI Bus Matrix Cache Arbiter L3 Cache State behavior as needed Unit Quad 2 Special Op Unit Read Only Cache Processor 0 Processor 1 Processor 2 Processor 3 – Thread Thread Thread Thread Prevents dead-lock State State State State Inst. Scheduler Inst. Scheduler Inst. Scheduler Inst. Scheduler SPU MPU SPU MPU SPU MPU SPU MPU • Supports Complex Scenarios Arbiter – Aggregates Threads Quad N Special Op Unit – Fractures Threads Processor 0 Processor 1 Processor 2 Processor 3 Thread Thread Thread Thread State State State State Inst. Scheduler Inst. Scheduler Inst. Scheduler Inst. Scheduler SPU MPU SPU MPU SPU MPU SPU MPU Arbiter THINCI, Inc. • www.thinci.com August 2017 Graph Execution Trace • Threads can execute from all nodes of the graph simultaneously • True hardware managed streaming behavior Graph Execution Trace Thread life-span Thread Count/Node Thread time THINCI, Inc. • www.thinci.com August 2017 Data Level Parallelism 2D Block Processing Parallel Reduction Instructions THINCI, Inc. • www.thinci.com August 2017 2D Block Processing/Reduction Instructions • Persistent data structures are accessed in blocks • Arbitrary alignment support 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 • Provides for “in-place compute” 1 src • Parallel reduction instructions 2 3 support efficient processing 4 – Reduced power 5 src – Greater throughput 6 7 – Reduced bandwidth 8 9 • Experience better scaling across dst 10 data types vs. the 2x scaling of 11 traditional vector pipelines THINCI, Inc. • www.thinci.com August 2017 Instruction Level Parallelism Hardware Instruction Scheduling THINCI, Inc. • www.thinci.com August 2017 Hardware Instruction Scheduling • Scheduling Groups of Four Processors – Hardware Instruction Picker Thread State Register Files Vector Pipeline – Selects from 100’s of threads Scalar Pipeline – Targets 10’s of independent pipelines Custom Arithmetic Instruction Scheduler Flow Control Instruction Decode Memory Ops. Move Pipeline Thread Spawn State Mgmt. THINCI, Inc. • www.thinci.com August 2017 Programming Model THINCI, Inc. • www.thinci.com August 2017 Programming Model • Fully Programmable – No a-priori constraints regarding data types, precision or graph topologies – Fully pipelined concurrent graph execution – Comprehensive SDK with support for all abstraction levels, assembly to frameworks • Machine Learning Frameworks – TensorFlow – Caffe – Torch • OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) – Provides rich graph creation and execution semantics – Extended with fully accelerated custom kernel support THINCI, Inc. • www.thinci.com August 2017 Results • Arithmetic Pipeline Utilization – 95% for CNN’s (VGG16, 8-bit) • Physical Characteristics – TSMC 28nm HPC+ – Standalone SoC Mode – PCIe Accelerator Mode – SoC Power Estimate: 2.5W THINCI, Inc. • www.thinci.com August 2017.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages17 Page
-
File Size-