Talend Open Studio for Big Data User Guide
Total Page:16
File Type:pdf, Size:1020Kb
Talend Open Studio for Big Data User Guide 6.0.0 Talend Open Studio for Big Data Adapted for v6.0.0. Supersedes previous releases. Publication date: July 2, 2015 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM, Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Avro, Apache Axiom, Apache Axis, Apache Axis 2, Apache Batik, Apache CXF, Apache Cassandra, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache Commons Lang, Apache Datafu, Apache Derby Database Engine and Embedded JDBC Driver, Apache Geronimo, Apache HCatalog, Apache Hadoop, Apache Hbase, Apache Hive, Apache HttpClient, Apache HttpComponents Client, Apache JAMES, Apache Log4j, Apache Lucene Core, Apache Neethi, Apache Oozie, Apache POI, Apache Parquet, Apache Pig, Apache PiggyBank, Apache ServiceMix, Apache Sqoop, Apache Thrift, Apache Tomcat, Apache Velocity, Apache WSS4J, Apache WebServices Common Utilities, Apache Xml-RPC, Apache Zookeeper, Box Java SDK (V2), CSV Tools, Cloudera HTrace, ConcurrentLinkedHashMap for Java, Couchbase Client, DataNucleus, DataStax Java Driver for Apache Cassandra, Ehcache, Ezmorph, Ganymed SSH-2 for Java, Google APIs Client Library for Java, Google Gson, Groovy, Guava: Google Core Libraries for Java, H2 Embedded Database and JDBC Driver, Hector: A high level Java client for Apache Cassandra, Hibernate BeanValidation API, Hibernate Validator, HighScale Lib, HsqlDB, Ini4j, JClouds, JDO-API, JLine, JSON, JSR 305: Annotations for Software Defect Detection in Java, JUnit, Jackson Java JSON-processor, Java API for RESTful Services, Java Agent for Memory Measurements, Jaxb, Jaxen, JetS3T, Jettison, Jetty, Joda-Time, Json Simple, LZ4: Extremely Fast Compression algorithm, LightCouch, MetaStuff, Metrics API, Metrics Reporter Config, Microsoft Azure SDK for Java, Mondrian, MongoDB Java Driver, Netty, Ning Compression codec for LZF encoding, OpenSAML, Paraccel JDBC Driver, Parboiled, PostgreSQL JDBC Driver, Protocol Buffers - Google's data interchange format, Resty: A simple HTTP REST client for Java, Riak Client, Rocoto, SDSU Java Library, SL4J: Simple Logging Facade for Java, SQLite JDBC Driver, Scala Lang, Simple API for CSS, Snappy for Java a fast compressor/ decompresser, SpyMemCached, SshJ, StAX API, StAXON - JSON via StAX, Super SCV, The Castor Project, The Legion of the Bouncy Castle, Twitter4J, Uuid, W3C, Windows Azure Storage libraries for Java, Woden, Woodstox: High-performance XML processor, Xalan-J, Xerces2, XmlBeans, XmlSchema Core, Xmlsec - Apache Santuario, YAML parser and emitter for Java, Zip4J, atinject, dropbox-sdk-java: Java library for the Dropbox Core API, google-guice. Licensed under their respective license. 4.6.1. How to map data flows ............ 62 Table of Contents 4.6.2. How to create queries using the SQLBuilder ........................... 62 Preface ................................................ vii 4.6.3. How to download/upload 1. General information ......................... vii Talend Community components ......... 67 1.1. Purpose .............................. vii 4.6.4. How to use the tPrejob and 1.2. Audience ............................ vii tPostjob components ..................... 73 1.3. Typographical conventions .......... vii 4.6.5. How to use the Use Output 2. Feedback and Support ....................... vii Stream feature ............................ 74 Chapter 1. Data Integration: Concepts 4.7. Handling Jobs: miscellaneous and Principles ........................................ 1 subjects ........................................... 74 1.1. Data analytics ................................ 2 4.7.1. How to use folders ................ 74 1.2. Operational integration ..................... 2 4.7.2. How to define component 1.3. Important terms in Talend Studio ......... 3 properties ................................. 75 Chapter 2. Introduction to Talend Big 4.7.3. How to share a database connection ................................ 84 Data solutions ........................................ 5 4.7.4. How to define the Start 2.1. Hadoop and Talend studio .................. 6 component ................................ 85 2.2. Functional architecture of Talend Big 4.7.5. How to handle error icons on Data solutions ..................................... 6 components or Jobs ...................... 86 Chapter 3. Designing a Business 4.7.6. How to add notes to a Job Model ................................................... 9 design ..................................... 88 3.1. What is a Business Model ................. 10 4.7.7. How to display the code or the 3.2. Opening or creating a Business outline of your Job ....................... 89 Model ............................................ 10 4.7.8. How to manage the subjob 3.2.1. How to open a Business display .................................... 90 Model ..................................... 10 4.7.9. How to define options on the 3.2.2. How to create a Business Job view .................................. 92 Model ..................................... 11 4.7.10. How to find components in 3.3. Modeling a Business Model ............... 12 Jobs ....................................... 94 3.3.1. Shapes ............................. 12 4.7.11. How to set default values in 3.3.2. Connecting shapes ................. 13 the schema of an component ............. 95 3.3.3. How to comment and arrange Chapter 5. Managing Jobs ..................... 99 a model ................................... 15 5.1. Activating/Deactivating a component 3.3.4. Business Models ................... 17 or a subjob ..................................... 100 3.4. Assigning repository elements to a 5.1.1. Activate or deactivate a Business Model .................................. 19 component ............................... 100 3.5. Editing a Business Model .................. 20 5.1.2. Activate or deactivate a subjob ... 100 3.6. Saving a Business Model ................... 21 5.1.3. Activate or deactivate all Chapter 4. Designing a Job .................... 23 linked subjobs ........................... 101 4.1. What is a Job design ....................... 24 5.2. Importing/exporting items and 4.2. Getting started with a basic Job ........... 24 building Jobs ................................... 101 4.2.1. Creating a Job ..................... 24 5.2.1. How to import items ............. 101 4.2.2. Adding components to the Job .... 26 5.2.2. How to build Jobs ................ 105 4.2.3. Connecting the components 5.2.3. How to export items ............. 112 together ................................... 30 5.2.4. How to change context 4.2.4. Configuring the components ....... 31 parameters in Jobs ...................... 114 4.2.5. Executing the Job .................. 33 5.3. Managing repository items ............... 115 4.3. Using connections .......................... 34 5.3.1. How to handle updates in 4.3.1. Connection types .................. 34 repository items ......................... 115 4.3.2. How to drop components in 5.4. Searching a Job in the repository ........ 117 the middle of a Row link ................. 39 5.5. Managing Job versions ................... 119 4.3.3. How to define connection 5.6. Documenting a Job ....................... 120 settings .................................... 40 5.6.1. How to generate HTML 4.4. Using contexts and variables .............. 42 documentation ........................... 120 4.4.1. How to define context 5.6.2. How to update the variables for a Job ........................ 42 documentation on the spot .............. 121 4.4.2. How to centralize context 5.7. Handling Job execution .................. 121 variables in the Repository ............... 49 5.7.1. How to run a Job in normal 4.4.3. How to apply context variables mode .................................... 122 to a Job ................................... 56 5.7.2. How to run a Job in Java 4.4.4. How to use variables in a Job ...... 58 Debug mode ............................ 123 4.4.5. How to run a Job in a selected 5.7.3. How to run a Job in Traces context .................................... 59 Debug mode ............................ 123 4.4.6. StoreSQLQuery .................... 60 5.7.4. How to set advanced execution 4.5. Using parallelization to optimize Job settings .................................. 125 performance ..................................... 60 5.7.5. How to deploy a Job on 4.5.1. How to execute multiple SpagoBI server .......................... 127 Subjobs