This talk presents a continuous application example that relies on Spark FAIR scheduler as the conductor to orchestrate the entire “lambda architecture” in a single spark context. As a typical time series event stream analysis might involved, there are four key components:
– an ETL step to store the raw data
– a series of real time aggregation on the joint of streaming input and historical data to power a model
– model execution
– ad-hoc query for human inspection.
The key benefits of this setup compared to a typical design that has a bunch of Spark application running individually are
1. Decouple streaming batches process from triggering model calculation, model calculations are triggered at a different pace from the stream processing.
2. Model is always processing the latest data, using pure rdd APIs.
3. Launch various operations in different threads on the driver node, ensuring them got submitted to the appropriate fair scheduler pool. Let FAIR scheduler to do the resource distribution.
4. Share code and time by sharing the actual data transformation (like the rdds in the intermediate steps).
5. Support adhoc queries on intermediate state without a dedicated serving layer or output protocol.
6. Only one app to monitor and tune.
Session hashtag: #SFdev17
Robert has a hybrid background of Stats, Econ and CS. He enjoys making decisions, especially through the system he built. During his three years at Groupon, he uses Spark to support a production low-latency anti-hacker system. To run the system faster and more robust, he has experimented multiple “hidden” usage patterns, which are not yet well-known by the vast majority, and has also written tools to fulfill these ideas. One of his tools, Sparklint, was open sourced during Spark Summit EU 2016.