
Faust Documentation Release 1.7.4 Robinhood Markets, Inc. Jul 23, 2019 CONTENTS 1 Contents 3 2 Indices and tables 457 Python Module Index 459 Index 461 i ii Faust Documentation, Release 1.7.4 ۶(◕‿◕)٩ Python Streams # # Forever scalable event processing & in-memory durable K/V store; # w/ asyncio & static typing. import faust Faust is a stream processing library, porting the ideas from Kafka Streams to Python. It is used at Robinhood to build high performance distributed systems and real-time data pipelines that process billions of events every day. Faust provides both stream processing and event processing, sharing similarity with tools such as Kafka Streams, Apache Spark/Storm/Samza/Flink, It does not use a DSL, it’s just Python! This means you can use all your favorite Python libraries when stream processing: NumPy, PyTorch, Pandas, NLTK, Django, Flask, SQLAlchemy, ++ Faust requires Python 3.6 or later for the new async/await syntax, and variable type annotations. Here’s an example processing a stream of incoming orders: app = faust.App('myapp', broker='kafka://localhost') # Models describe how messages are serialized: # {"account_id": "3fae-...", amount": 3} class Order(faust.Record): account_id: str amount: int @app.agent(value_type=Order) async def order(orders): async for order in orders: # process infinite stream of orders. print(f'Order for {order.account_id}: {order.amount}') The Agent decorator defines a “stream processor” that essentially consumes from a Kafka topic and does something for every event it receives. The agent is an async def function, so can also perform other operations asynchronously, such as web requests. This system can persist state, acting like a database. Tables are named distributed key/value stores you can use as regular Python dictionaries. Tables are stored locally on each machine using a super fast embedded database written in C++, called RocksDB. Tables can also store aggregate counts that are optionally “windowed” so you can keep track of “number of clicks from the last day,” or “number of clicks in the last hour.” for example. Like Kafka Streams, we support tumbling, hopping and sliding windows of time, and old windows can be expired to stop data from filling up. For reliability we use a Kafka topic as “write-ahead-log”. Whenever a key is changed we publish to the changelog. Standby nodes consume from this changelog to keep an exact replica of the data and enables instant recovery should any of the nodes fail. To the user a table is just a dictionary, but data is persisted between restarts and replicated across nodes so on failover other nodes can take over automatically. You can count page views by URL: # data sent to 'clicks' topic sharded by URL key. # e.g. key="http://example.com" value="1" click_topic = app.topic('clicks', key_type=str, value_type=int) (continues on next page) CONTENTS 1 Faust Documentation, Release 1.7.4 (continued from previous page) # default value for missing URL will be 0 with `default=int` counts = app.Table('click_counts', default=int) @app.agent(click_topic) async def count_click(clicks): async for url, count in clicks.items(): counts[url] += count The data sent to the Kafka topic is partitioned, which means the clicks will be sharded by URL in such a way that every count for the same URL will be delivered to the same Faust worker instance. Faust supports any type of stream data: bytes, Unicode and serialized structures, but also comes with “Models” that use modern Python syntax to describe how keys and values in streams are serialized: # Order is a json serialized dictionary, # having these fields: class Order(faust.Record): account_id: str product_id: str price: float quantity: float = 1.0 orders_topic = app.topic('orders', key_type=str, value_type=Order) @app.agent(orders_topic) async def process_order(orders): async for order in orders: # process each order using regular Python total_price = order.price * order.quantity await send_order_received_email(order.account_id, order) Faust is statically typed, using the mypy type checker, so you can take advantage of static types when writing applications. The Faust source code is small, well organized, and serves as a good resource for learning the implementation of Kafka Streams. Learn more about Faust in the Introducing Faust introduction page to read more about Faust, system requirements, installation instructions, community resources, and more. or go directly to the Quick Start tutorial to see Faust in action by programming a streaming application. then explore the User Guide for in-depth information organized by topic. 2 CONTENTS CHAPTER ONE CONTENTS 1.1 Copyright Faust User Manual Copyright © 2017-2019, Robinhood Markets, Inc. All rights reserved. This material may be copied or distributed only subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 4.0 International <http://creativecommons.org/licenses/by-sa/4.0/legalcode>‘_ license. You may share and adapt the material, even for commercial purposes, but you must give the original author credit. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same license or a license compatible to this one. Note: While the Faust documentation is offered under the Creative Commons Attribution-ShareAlike 4.0 International license the Faust software is offered under the BSD License (3 Clause) 1.2 Introducing Faust Version 1.7.4 Web http://faust.readthedocs.io/ Download http://pypi.org/project/faust Source http://github.com/robinhood/faust Keywords distributed, stream, async, processing, data, queue Table of Contents • What can it do? • How do I use it? • What do I need? • Extensions • Design considerations 3 Faust Documentation, Release 1.7.4 • Getting Help • Resources • License 1.2.1 What can it do? Agents Process infinite streams in a straightforward manner using asynchronous generators. The concept of “agents” comes from the actor model, and means the stream processor can execute concurrently on many CPU cores, and on hundreds of machines at the same time. Use regular Python syntax to process streams and reuse your favorite libraries: @app.agent() async def process(stream): async for value in stream: process(value) Tables Tables are sharded dictionaries that enable stream processors to be stateful with persistent and durable data. Streams are partitioned to keep relevant data close, and can be easily repartitioned to achieve the topology you need. In this example we repartition an order stream by account id, to count orders in a distributed table: import faust # this model describes how message values are serialized # in the Kafka "orders" topic. class Order(faust.Record, serializer='json'): account_id: str product_id: str amount: int price: float app = faust.App('hello-app', broker='kafka://localhost') orders_kafka_topic = app.topic('orders', value_type=Order) # our table is sharded amongst worker instances, and replicated # with standby copies to take over if one of the nodes fail. order_count_by_account = app.Table('order_count', default=int) @app.agent(orders_kafka_topic) async def process(orders: faust.Stream[Order]) -> None: async for order in orders.group_by(Order.account_id): order_count_by_account[order.account_id] += 1 If we start multiple instances of this Faust application on many machines, any order with the same account id will be received by the same stream processing agent, so the count updates correctly in the table. Sharding/partitioning is an essential part of stateful stream processing applications, so take this into account when designing your system, but note that streams can also be processed in round-robin order so you can use Faust for event processing and as a task queue also. Asynchronous with asyncio Faust takes full advantage of asyncio and the new async/await keywords in Python 3.6+ to run multiple stream processors in the same process, along with web servers and other network services. 4 Chapter 1. Contents Faust Documentation, Release 1.7.4 Thanks to Faust and asyncio you can now embed your stream processing topology into your existing asyncio/gevent/ eventlet/Twisted/Tornado applications. Faust is… Simple Faust is extremely easy to use. To get started using other stream processing solutions you have complicated hello-world projects, and infrastructure requirements. Faust only requires Kafka, the rest is just Python, so If you know Python you can already use Faust to do stream processing, and it can integrate with just about anything. Here’s one of the easier applications you can make: import faust class Greeting(faust.Record): from_name: str to_name: str app = faust.App('hello-app', broker='kafka://localhost') topic = app.topic('hello-topic', value_type=Greeting) @app.agent(topic) async def hello(greetings): async for greeting in greetings: print(f'Hello from {greeting.from_name} to {greeting.to_name}') @app.timer(interval=1.0) async def example_sender(app): await hello.send( value=Greeting(from_name='Faust', to_name='you'), ) if __name__ == '__main__': app.main() You’re probably a bit intimidated by the async and await keywords, but you don’t have to know how asyncio works to use Faust: just mimic the examples, and you’ll be fine. The example application starts two tasks: one is processing a stream, the other is a background thread sending events to that stream. In a real-life application, your system will publish events to Kafka topics that your processors can consume from, and the background thread is only needed to feed data into our example. Highly Available Faust is highly available and can survive network problems and server crashes. In the case of node failure, it can automatically recover, and tables have standby nodes that will take over. Distributed Start more instances of your application as needed. Fast A single-core Faust worker instance can already process tens of thousands of events every second, and we are reasonably confident that throughput will increase once we can support a more optimized Kafka client.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages515 Page
-
File Size-