Streaming at Spark Summit East 2016

2016-03-15Home

Spark Summit East 2016 is held in NYC last month. Databricks already had a look back, and I'm going to focus on the (Spark) streaming part here.

Highlights

The most interesting ones are

Spark Streaming and Iot

Mike Freedman, CEO and Co-Founder of iobeam, mainly talked about the challenges in applying Spark to IoT.

challenges_applying_spark_iot

I like this talk because these challenges are quite general. It's unclear how iobeam solved them with Spark Streaming which only supports data arrival time. iobeam is a data analysis platform designed for IoT. I really enjoy their websites which put codes side-by-side with use cases.

Online Security Analytics on Large Scale Video Survellance System

This is from EMC Video Analytics Data Lake, where Spark Streaming is used for online video processing and detection.

online_video_processing

Streaming application serves to feed offline model training which is in turn used to realtime detection.

Clickstream Analysis with Spark—Understanding Visitors in Realtime

The talk is really about the architecture evolution from "Larry & Friends" (Oracle) to "Hadoop & Friends" (HDFS, Hive), from Kappa-Architecture to Lambda-Architecture, and finally Mu-Architecture all based on Spark.

connection_streaming_batch

Note that realtime here means 15 mins so a low latency streaming engine like Storm is overengineered. It's, however, a sweet spot for Spark Streaming given the other components in the system are also based on Spark.

Core

Spark 2.0 will add an infinite Dataframes API for Spark Streaming, unified with the existing Dataframes API for batch processing. Event-time aggregations will finally arrive in Spark Streaming.