IBM Infosphere Streams: Assembling Continuous Insight in the Information Revolution
Total Page:16
File Type:pdf, Size:1020Kb
Front cover IBM® Information Management Software IBM InfoSphere Streams Assembling Continuous Insight in the Information Revolution Supporting scalability and dynamic adaptability Performing real-time analytics on Big Data Enabling continuous analysis of data Chuck Ballard Kevin Foster Andy Frenkiel Senthil Nathan Bugra Gedik Deepak Rajan Roger Rea Brian Williams Mike Spicer Michael P. Koranda Vitali N. Zoubov ibm.com/redbooks International Technical Support Organization IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution October 2011 SG24-7970-00 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. First Edition (October 2011) This edition applies to Version 2.0.0 of InfoSphere Streams (Product Number 5724-Y95). © Copyright International Business Machines Corporation 2011. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . ix Trademarks . x Preface . xi The team who wrote this book . xii Now you can become a published author, too! . xvii Comments welcome. xvii Stay connected to IBM Redbooks . xviii Chapter 1. Introduction. 1 1.1 Stream computing . 2 1.1.1 Business landscape . 6 1.1.2 Information environment . 9 1.1.3 The evolution of analytics . 14 1.1.4 Relationship to Big Data . 17 1.2 IBM InfoSphere Streams . 17 1.2.1 Overview of Streams. 19 1.2.2 Why use Streams . 24 1.2.3 Examples of Streams implementations. 27 Chapter 2. Streams concepts and terms. 33 2.1 IBM InfoSphere Streams: Solving new problems . 34 2.2 Concepts and terms . 39 2.2.1 Streams instances, hosts, host types, and admin services. 40 2.2.2 Projects, applications, streams, and operators . 44 2.2.3 Applications, Jobs, Processing Elements, and Containers . 47 2.3 End-to-end example: Streams and the lost child. 48 2.3.1 The Lost Child application. 50 2.3.2 Example Streams Processing Language code review . 53 2.4 IBM InfoSphere Streams tools . 58 2.4.1 Creating an example application inside Streams Studio. 59 Chapter 3. Streams applications . 67 3.1 Streams application design . 69 3.1.1 Design aspects . 70 3.1.2 Data sources . 72 3.1.3 Output . 75 3.1.4 Existing analytics. 78 3.1.5 Performance requirements . 79 © Copyright IBM Corp. 2011. All rights reserved. iii 3.2 SPL design patterns . 80 3.2.1 Reducing data using a simple filter . 82 3.2.2 Reducing data using a simple schema mapper . 86 3.2.3 Data Parallel . 90 3.2.4 Data Pipeline. 97 3.2.5 Outlier detection . 101 3.2.6 Enriching streams from a database . 107 3.2.7 Tuple Pacer. 111 3.2.8 Processing multiplexed streams . 116 3.2.9 Updating import stream subscriptions. 120 3.2.10 Updating export stream properties . 127 3.2.11 Adapting to observations of runtime metrics . 134 Chapter 4. InfoSphere Streams deployment. 139 4.1 Architecture, instances, and topologies. 140 4.1.1 Runtime architecture . 140 4.1.2 Streams instances. 143 4.1.3 Deployment topologies . 147 4.2 Streams runtime deployment planning . 151 4.2.1 Streams environment . 151 4.2.2 Sizing the environment . 151 4.2.3 Deployment and installation checklists . 152 4.3 Pre- and post-installation of the Streams environment . 160 4.3.1 Installation and configuration of the Linux environment . 160 4.3.2 InfoSphere Streams installation . 163 4.3.3 Post-installation configuration . 164 4.3.4 Streams user configuration . 168 4.4 Streams instance creation and configuration . 170 4.4.1 Streams shared instance configuration. 170 4.4.2 Streams private developer instance configuration . 175 4.5 Application deployment capabilities . 179 4.5.1 Dynamic application composition . 181 4.5.2 Operator host placement. 184 4.5.3 Operator partitioning . 189 4.5.4 Parallelizing operators. 190 4.6 Failover, availability, and recovery . 194 4.6.1 Restarting and relocating processing elements . 194 4.6.2 Recovering application hosts . 197 4.6.3 Recovering management hosts . 199 Chapter 5. Streams Processing Language . 203 5.1 Language elements. 204 5.1.1 Structure of an SPL program file. 206 iv IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution 5.1.2 Streams data types . 209 5.1.3 Stream schemas . 210 5.1.4 Streams punctuation markers . 212 5.1.5 Streams windows . 212 5.1.6 Stream bundles . 220 5.2 Streams Processing Language operators . 220 5.2.1 Operator usage language syntax . 225 5.2.2 The Source operator . 227 5.2.3 The Sink operator . 231 5.2.4 The Functor operator . 232 5.2.5 The Aggregate operator . 234 5.2.6 The Split operator . 236 5.2.7 The Punctor operator . 240 5.2.8 The Delay operator . 241 5.2.9 The Barrier operator . 241 5.2.10 The Join operator . 242 5.2.11 The Sort operator . 245 5.2.12 Additional SPL operators . 247 5.3 The preprocessor ..