Thrift Tutorial Release 1.0

Total Page:16

File Type:pdf, Size:1020Kb

Thrift Tutorial Release 1.0 Thrift Tutorial Release 1.0 Stratos Dimopoulos Sep 21, 2017 Contents 1 Introduction to Thrift 3 2 Thrift installation 5 3 Thrift protocol stack 9 4 Thrift type system 13 5 Writing a .thrift file 15 6 Thrift by Example 19 7 Indices and tables 51 i ii Thrift Tutorial, Release 1.0 The purpose of this document is to describe step by step Thrift’s installation and give simple examples of its usage. Contents 1 Thrift Tutorial, Release 1.0 2 Contents CHAPTER 1 Introduction to Thrift Thrift is a lightweight, language-independent software stack with an associated code gen- eration mechanism for RPC. Thrift provides clean abstractions for data transport, data se- rialization, and application level processing. Thrift was originally developed by Facebook and now it is open sourced as an Apache project. Apache Thrift is a set of code- generation tools that allows developers to build RPC clients and servers by just defining the data types and service interfaces in a sim- ple definition file. Given this file as an input, code is generated to build RPC clients and servers that communicate seamlessly across programming languages. In this tutorial I will describe how Thrift works and provide a guide for build and in- stallation steps, how to write thrift files and how to generate from those files the source code that can be used from different client libraries to communicate with the server. Thrift supports a variety of languages includ- ing C++, Java, Python, PHP, Ruby but for simplicity I will focus this tutorial on exam- ples that include Java and Python. 3 Fig. 1.1: Thrift Architecture. Image from wikipedia.org Thrift Tutorial, Release 1.0 Fig. 1.2: Services using Thrift 4 Chapter 1. Introduction to Thrift CHAPTER 2 Thrift installation Detailed information on how to install Thrift can be found here: http://thrift.apache.org/docs/install/ On Ubuntu Linux for example you just need to first install the dependencies and then you are ready to install Thrift. 1. Install the languages with which you plan to use thrift. To use with Java for example, install a Java JDK you prefer. In this demo I am using Oracle JDK 7 for Ubuntu, but you shouldn’t have problem using the one you like. • To use with Java you will also need to install Apache Ant sudo apt-get install ant 2. Installing required tools and libraries: sudo apt-get install libboost-dev libboost-test-dev libboost-program- ,!options-dev libboost-filesystem-dev libboost-thread-dev libevent-dev ,!automake libtool flex bison pkg-config g++ libssl-dev • You can check for specific requirements for each language you wish to use here: http://thrift.apache. org/docs/install/ 1. Download Thrift: http://thrift.apache.org/download 2. Copy the downloaded file into the desired directory and untar the file tar-xvf thrift-0.9.3.tar.gz 3. For detailed instuctions on how to build Apache Thrift on your specific system you can read here: http://thrift.apache.org/docs/BuildingFromSource • For an Ubuntu linux distribution you just need to go to the thrift directory and type: ./bootstrap.sh ./configure • At the end of the output you should be able to see a list of all the libraries that are currently built in your system and ready to use with your desired programming languages. If a component is missing you should download the missing language and repeat the above step. 5 Thrift Tutorial, Release 1.0 thrift 0.9.3 Building C++ Library......... : yes Building C (GLib) Library.... : no Building Java Library........ : yes Building C# Library .......... : no Building Python Library...... : yes Building Ruby Library........ : no Building Haskell Library..... : no Building Perl Library........ : no Building PHP Library......... : no Building Erlang Library...... : no Building Go Library.......... : no Building D Library........... : no C++ Library: Build TZlibTransport...... : yes Build TNonblockingServer.. : yes Build TQTcpServer (Qt).... : no Java Library: Using javac............... : javac Using java................ : java Using ant.................:/usr/bin/ant Python Library: Using Python..............:/usr/bin/python • Here http://thrift.apache.org/docs/install/debian/ you can find all the packages you might need to support your desired language in case some of them are missing. • On the same directory run make to build Thrift sudo make • (Optional) Run the test suite if you want sudo make check • And finally you are ready to install Thrift by running sudo make install Verify installation Now your Thrift installation is completed! To verify that you have succesfully installed Thrift just type thrift-version and you should be able to see something like the following: Thrift version 0.9.0 Additionaly you can go to the tutorial directory and follow the instructions located on the README files on each of the targeted languages directories. For example for JAVA you should be able to verify the following: 6 Chapter 2. Thrift installation Thrift Tutorial, Release 1.0 thrift/tutorial/java$ file ../../lib/java/build/libthrift-${version}-${release}.jar ../../lib/java/libthrift.jar: Zip archive data, at least v1.0 to extract thrift/tutorial/java$ file ../../compiler/cpp/thrift ../../compiler/cpp/thrift: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), ,!dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped thrift/tutorial/java$ ls ../../lib/java/build/lib/ commons-lang-2.5.jar junit-4.4.jar servlet-api-2.5.jar slf4j-api-1.5.8.jar slf4j- ,!simple-1.5.8.jar 2.1. Verify installation 7 Thrift Tutorial, Release 1.0 8 Chapter 2. Thrift installation CHAPTER 3 Thrift protocol stack To understand Thrift’s protocol stack I highly recommend to take a look at the white-paper: http://thrift.apache.org/static/files/thrift-20070401.pdf and to the Thrift Architecture part of this very nice description here: http://jnb.ociweb.com/jnb/jnbJun2009.html - Part of the following con- tent is directly derived from these sources. I will now briefly present Thrift’s architecture. Runtime Library The protocol and transport layer are part of the runtime library. This means that it is pos- sible to define a service and change the pro- tocol and transport without recompiling the code. Protocol Layer The protocol layer provides serialization and deserialization. Thrift supports the following protocols : • TBinaryProtocol - A straight-forward binary format encoding numeric val- ues as binary, rather than converting to text. • TCompactProtocol - Very efficient, dense encoding of data (See details be- low). • TDenseProtocol - Similar to TCom- pactProtocol but strips off the meta 9 Fig. 3.1: Thrift Architecture. Image from wikipedia.org Thrift Tutorial, Release 1.0 information from what is transmitted, and adds it back in at the receiver. TDenseProtocol is still experimental and not yet available in the Java im- plementation. • TJSONProtocol - Uses JSON for en- coding of data. • TSimpleJSONProtocol - A write-only protocol using JSON. Suitable for parsing by scripting languages • TDebugProtocol - Uses a human- readable text format to aid in debug- ging. Tranport Layer The transport layer is responsible for reading from and writing to the wire. Thrift supports the following: • TSocket - Uses blocking socket I/O for transport. • TFramedTransport - Sends data in frames, where each frame is preceded by a length. This transport is required when using a non-blocking server. • TFileTransport - This transport writes to a file. While this transport is not in- cluded with the Java implementation, it should be simple enough to imple- ment. • TMemoryTransport - Uses memory for I/O. The Java implementation uses a simple ByteArrayOutputStream in- ternally. • TZlibTransport - Performs compres- sion using zlib. Used in conjunction with another transport. Not available in the Java implementation. Processor The processor takes as arguments an input and an output protocol. Reads data from the input, processes the data throught the Handler specified by the user and then writes the data to the output. 10 Chapter 3. Thrift protocol stack Thrift Tutorial, Release 1.0 Supported Servers A server will be listening for connections to a port and will send the data it receives to the Processor to handle. • TSimpleServer - A single-threaded server using std blocking io. Useful for testing. • TThreadPoolServer - A multi-threaded server using std blocking io. • TNonblockingServer - A multi-threaded server using non-blocking io (Java implementation uses NIO channels). TFramedTransport must be used with this server. 3.3. Supported Servers 11 Thrift Tutorial, Release 1.0 12 Chapter 3. Thrift protocol stack CHAPTER 4 Thrift type system The thrift type system includes base types like bool, byte, double, string and integer but also special types like binary and it also supports structs (equivalent to classes but without inheritance) and also containers (list, set, map) that correspond to commonly available interfaces in most programming languages. The type system focuses on key types available in all programming languages and omits types that are specific to only some programming languages. • A detailed reference to Thrift type system follows next and more details can be found here: http://thrift.apache. org/docs/types • If you want to check the Thrift interface description language that allows for the definition of Thrift types you can read here: http://thrift.apache.org/docs/idl Base types • bool: A boolean value (true or false) • byte: An 8-bit signed integer • i16: A 16-bit signed integer • i32: A 32-bit signed integer • i64: A 64-bit signed integer • double: A 64-bit floating point number • string: A text string encoded using UTF-8 encoding Note: There is no support for unsigned integer types, due to the fact that there are no native unsigned integer types in many programming languages. Signed integers can be safely cast to their unsigned counterparts when necessary. Special Types binary: a sequence of unencoded bytes 13 Thrift Tutorial, Release 1.0 Structs A struct has a set of strongly typed fields, each with a unique name identifier.
Recommended publications
  • Deserialization Vulnerability by Abdelazim Mohammed(@Intx0x80)
    Deserialization vulnerability By Abdelazim Mohammed(@intx0x80) Thanks to: Mazin Ahmed (@mazen160) Asim Jaweesh(@Jaw33sh) 1 | P a g e Table of Contents Serialization (marshaling): ............................................................................................................................ 4 Deserialization (unmarshaling): .................................................................................................................... 4 Programming language support serialization: ............................................................................................... 4 Risk for using serialization: .......................................................................................................................... 5 Serialization in Java ...................................................................................................................................... 6 Deserialization vulnerability in Java: ............................................................................................................ 6 Code flow work........................................................................................................................................... 11 Vulnerability Detection: .............................................................................................................................. 12 CVE: ........................................................................................................................................................... 17 Tools: .........................................................................................................................................................
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Serializing C Intermediate Representations for Efficient And
    Serializing C intermediate representations for efficient and portable parsing Jeffrey A. Meister1, Jeffrey S. Foster2,∗, and Michael Hicks2 1 Department of Computer Science and Engineering, University of California, San Diego, CA 92093, USA 2 Department of Computer Science, University of Maryland, College Park, MD 20742, USA SUMMARY C static analysis tools often use intermediate representations (IRs) that organize program data in a simple, well-structured manner. However, the C parsers that create IRs are slow, and because they are difficult to write, only a few implementations exist, limiting the languages in which a C static analysis can be written. To solve these problems, we investigate two language-independent, on-disk representations of C IRs: one using XML, and the other using an Internet standard binary encoding called XDR. We benchmark the parsing speeds of both options, finding the XML to be about a factor of two slower than parsing C and the XDR over six times faster. Furthermore, we show that the XML files are far too large at 19 times the size of C source code, while XDR is only 2.2 times the C size. We also demonstrate the portability of our XDR system by presenting a C source code querying tool in Ruby. Our solution and the insights we gained from building it will be useful to analysis authors and other clients of C IRs. We have made our software freely available for download at http://www.cs.umd.edu/projects/PL/scil/. key words: C, static analysis, intermediate representations, parsing, XML, XDR 1. Introduction There is significant interest in writing static analysis tools for C programs.
    [Show full text]
  • The Programmer's Guide to Apache Thrift MEAP
    MEAP Edition Manning Early Access Program The Programmer’s Guide to Apache Thrift Version 5 Copyright 2013 Manning Publications For more information on this and other Manning titles go to www.manning.com ©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. http://www.manning-sandbox.com/forum.jspa?forumID=873 Licensed to Daniel Gavrila <[email protected]> Welcome Hello and welcome to the third MEAP update for The Programmer’s Guide to Apache Thrift. This update adds Chapter 7, Designing and Serializing User Defined Types. This latest chapter is the first of the application layer chapters in Part 2. Chapters 3, 4 and 5 cover transports, error handling and protocols respectively. These chapters describe the foundational elements of Apache Thrift. Chapter 6 describes Apache Thrift IDL in depth, introducing the tools which enable us to describe data types and services in IDL. Chapters 7 through 9 bring these concepts into action, covering the three key applications areas of Apache Thrift in turn: User Defined Types (UDTs), Services and Servers. Chapter 7 introduces Apache Thrift IDL UDTs and provides insight into the critical role played by interface evolution in quality type design. Using IDL to effectively describe cross language types greatly simplifies the transmission of common data structures over messaging systems and other generic communications interfaces. Chapter 7 demonstrates the process of serializing types for use with external interfaces, disk I/O and in combination with Apache Thrift transport layer compression.
    [Show full text]
  • Implementation of a Serializer to Represent PHP Objects in the Extensible Markup Language Lidia N
    World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:13, No:7, 2019 Implementation of a Serializer to Represent PHP Objects in the Extensible Markup Language Lidia N. Hernández-Piña, Carlos R. Jaimez-González context of data storage and transmission, serialization is the Abstract—Interoperability in distributed systems is an important process of rendering an object into a state that can be saved feature that refers to the communication of two applications written persistently into a storage medium, such as a file, database, or in different programming languages. This paper presents a serializer a stream to be transmitted through the network. De- and a de-serializer of PHP objects to and from XML, which is an serialization is the opposite process, which puts the serialized independent library written in the PHP programming language. The XML generated by this serializer is independent of the programming version of the object into a live object in memory" [1]. language, and can be used by other existing Web Objects in XML This paper presents a serializer and a de-serializer of PHP (WOX) serializers and de-serializers, which allow interoperability objects to XML, called PHP Web Objects in XML with other object-oriented programming languages. (PHPWOX) [2]. The XML generated by PHPWOX is independent of the programming language, and can be used by Keywords—Interoperability, PHP object serialization, PHP to other existing serializers and de-serializers WOX [3], which XML, web objects in XML, WOX. allow interoperability between applications written in PHP and applications written in the programming languages supported I.
    [Show full text]
  • The Phpserialize Package
    The phpSerialize Package March 9, 2005 Title Serialize R to PHP associative array Version 0.8-01 Author Dieter Menne <[email protected]> Description Serializes R objects for import by PHP into an associative array. Can be used to build interactive web pages with R. Maintainer Dieter Menne<[email protected]> License GPL version 2 or newer R topics documented: phpSerialize . 1 Index 4 phpSerialize R to PHP Serialization Description Serializes R objects for PHP import into an associative array. Main use is for building web pages with R-support. Usage phpSerialize(x, file = NULL, append = FALSE, associative = "1D", simplifyMono=TRUE, phpTestCode = FALSE) phpSerializeAll(include=NULL, exclude=NULL, file = NULL, append = FALSE, associative = "1D", simplifyMono=TRUE, phpTestCode = FALSE) 1 2 phpSerialize Arguments x an object file a file name or a connection, or NULL to return the output as a string. If the connection is not open it will be opened and then closed on exit. append append or overwrite the file? associative a character string of "1D" (default), "2D" or "no". For "1D", only scalars and vectors are serialized as associative arrays which can be retrieved by name, while arrays are exported with numeric indexes. For "2D", arrays are also exported by name. For "no", objects are serialized with numerical indexes only, and factors are serialized as integers. simplifyMono if TRUE (default), unnamed vectors of length 1 are reduced to scalars, which is easier to understand in PHP. For simplyMono=FALSE, these vectors are serialized as arrays with length 1. We need another level of indexing ([1]) in PHP, but we are closer to the way R ticks.
    [Show full text]
  • File IO Serialization
    File IO serialization HOM HVD Streams and Java-I/O Streams in Java File IO serialization java.nio.file Object Streams and serialization Pieter van den Hombergh Serialisation formats Richard van den Ham WARNING Javascript Object Notation Fontys Hogeschool voor Techniek en Logistiek March 13, 2018 HOMHVD/FHTenL File IO serialization March 13, 2018 1/23 File IO Topics serialization HOM HVD Streams and Streams and Java-I/O Java-I/O Streams in Java Streams in Java java.nio.file java.nio.file Object Streams and serialization Serialisation formats WARNING Object Streams and serialization Javascript Object Notation Serialisation formats WARNING Javascript Object Notation HOMHVD/FHTenL File IO serialization March 13, 2018 2/23 File IO Streams in Java serialization HOM HVD Streams and Java-I/O Streams in Java java.nio.file Object Streams and serialization Serialisation formats WARNING Javascript Object Notation Figure: Taken from the Oracle/Sun Java tutorial Read or write information from different sources and types. sources: network, files, devices types: text, picture, sound Streams are FIFOs, uni-directional HOMHVD/FHTenL File IO serialization March 13, 2018 3/23 File IO Two basic stream types serialization HOM Byte Streams HVD java.io.InputStream, java.io.OutputStream Streams and Java-I/O read and write pdf, mp3, raw... Streams in Java java.nio.file Character Streams (16-bit Unicode) Object Streams java.io.Reader, java.io.Writer and serialization simplifies reading and writing characters, character-arrays or Strings Serialisation formats WARNING Javascript
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Kyuubi Release 1.3.0 Kent
    Kyuubi Release 1.3.0 Kent Yao Sep 30, 2021 USAGE GUIDE 1 Multi-tenancy 3 2 Ease of Use 5 3 Run Anywhere 7 4 High Performance 9 5 Authentication & Authorization 11 6 High Availability 13 6.1 Quick Start................................................ 13 6.2 Deploying Kyuubi............................................ 47 6.3 Kyuubi Security Overview........................................ 76 6.4 Client Documentation.......................................... 80 6.5 Integrations................................................ 82 6.6 Monitoring................................................ 87 6.7 SQL References............................................. 94 6.8 Tools................................................... 98 6.9 Overview................................................. 101 6.10 Develop Tools.............................................. 113 6.11 Community................................................ 120 6.12 Appendixes................................................ 128 i ii Kyuubi, Release 1.3.0 Kyuubi™ is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark™. In general, the complete ecosystem of Kyuubi falls into the hierarchies shown in the above figure, with each layer loosely coupled to the other. For example, you can use Kyuubi, Spark and Apache Iceberg to build and manage Data Lake with pure SQL for both data processing e.g. ETL, and analytics e.g. BI. All workloads can be done on one platform, using one copy of data, with one SQL interface. Kyuubi provides the following features: USAGE GUIDE 1 Kyuubi, Release 1.3.0 2 USAGE GUIDE CHAPTER ONE MULTI-TENANCY Kyuubi supports the end-to-end multi-tenancy, and this is why we want to create this project despite that the Spark Thrift JDBC/ODBC server already exists. 1. Supports multi-client concurrency and authentication 2. Supports one Spark application per account(SPA). 3. Supports QUEUE/NAMESPACE Access Control Lists (ACL) 4.
    [Show full text]
  • Implementing Replication for Predictability Within Apache Thrift Jianwei Tu the Ohio State University [email protected]
    Implementing Replication for Predictability within Apache Thrift Jianwei Tu The Ohio State University [email protected] ABSTRACT have a large number of packets. A study indicated that about Interactive applications, such as search, social networking and 0.02% of all flows contributed more than 59.3% of the total retail, hosted in cloud data center generate large quantities of traffic volume [1]. TCP is the dominating transport protocol small workloads that require extremely low median and tail used in data center. However, the performance for short flows latency in order to provide soft real-time performance to users. in TCP is very poor: although in theory they can be finished These small workloads are known as short TCP flows. in 10-20 microseconds with 1G or 10G interconnects, the However, these short TCP flows experience long latencies actual flow completion time (FCT) is as high as tens of due in part to large workloads consuming most available milliseconds [2]. This is due in part to long flows consuming buffer in the switches. Imperfect routing algorithm such as some or all of the available buffers in the switches [3]. ECMP makes the matter even worse. We propose a transport Imperfect routing algorithms such as ECMP makes the matter mechanism using replication for predictability to achieve low even worse. State of the art forwarding in enterprise and data flow completion time (FCT) for short TCP flows. We center environment uses ECMP to statically direct flows implement replication for predictability within Apache Thrift across available paths using flow hashing. It doesn’t account transport layer that replicates each short TCP flow and sends for either current network utilization or flow size, and may out identical packets for both flows, then utilizes the first flow direct many long flows to the same path causing flash that finishes the transfer.
    [Show full text]
  • Pharmaceutical Serialization Track and Trace
    PHARMACEUTICAL SERIALIZATION TRACK & TRACE EASY GUIDE TO COUNTRY- WISE MANDATES External Document © 2018 Infosys Limited External Document © 2018 Infosys Limited Table of Contents Introduction 05 US Federal 06 California 08 Argentina 10 Brazil 12 South Korea 14 India 16 China 18 EU 20 France 22 Turkey 24 References 26 External Document © 2018 Infosys Limited External Document © 2018 Infosys Limited External Document © 2018 Infosys Limited Introduction Pharmaceutical companies have to contend with challenges stemming from supply chain security lapses (resulting in theft, diversion and product recalls), counterfeiting and stringent regulations. In addition to alarming safety concerns, these challenges also impair the health of the industry by adversely impacting profits, brand credibility and research initiatives. With both industry and governments around the world realizing the significance of implementing product serialization, it becomes mandatory for all entities within the supply chain to comply with federal and/or state legislations pertaining to the locations in which they operate. Typically, drug distribution systems consist of entities such as manufacturers, wholesale distributors and pharmacies before products reach the end consumer. Ensuring secure product track and trace capabilities across various touch points throughout the supply chain - through product serialization implementation - is crucial to address the challenges faced by the industry. Apart from providing visibility and full traceability within the supply chain, successful serialization programs will prove to be a key differentiator and a clear competitive advantage for pharmaceutical companies. This guide, compiled by the Legal and Research team at the Infosys Life Sciences Center of Excellence provides a summary of the legal and regulatory framework proposed by countries including Argentina, Brazil, China, EU, France, India, South Korea, Turkey, and USA (Federal and California) to maintain supply chain integrity and ensure patient safety.
    [Show full text]
  • Rprotobuf: Efficient Cross-Language Data Serialization in R
    RProtoBuf: Efficient Cross-Language Data Serialization in R Dirk Eddelbuettel Murray Stokely Jeroen Ooms Debian Project Google, Inc UCLA Abstract Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between different programs or systems, such as CSV or JSON, are verbose, inefficient, not type-safe, or tied to a specific program- ming language. Protocol Buffers are a popular method of serializing structured data between applications|while remaining independent of programming languages or oper- ating systems. They offer a unique combination of features, performance, and maturity that seems particulary well suited for data-driven applications and numerical comput- ing. The RProtoBuf package provides a complete interface to Protocol Buffers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services. Keywords: R, Rcpp, Protocol Buffers, serialization, cross-platform. 1. Introduction Modern data collection and analysis pipelines increasingly involve collections of decoupled components in order to better manage software complexity through reusability, modularity, and fault isolation (Wegiel and Krintz 2010). These pipelines are frequently built using dif- ferent programming languages for the different phases of data analysis | collection, cleaning, modeling, analysis, post-processing, and presentation | in order to take advantage of the unique combination of performance, speed of development, and library support offered by different environments and languages. Each stage of such a data analysis pipeline may pro- arXiv:1401.7372v1 [stat.CO] 28 Jan 2014 duce intermediate results that need to be stored in a file, or sent over the network for further processing.
    [Show full text]