2017-04-27Home
Flink Forward, "The premier conference on Apache Flink®", just took place in San Francisco. All the slides and videos are available now. The conference was both abundant in practical experiences and technical details. After going through all the slides, I'd like to share some interesting contents that you can't find on http://flink.apache.org/. (I wasn't there and neither have I watched all the videos so take my readings with a grain of salt)
Update: data Artisans has published an official recap: On the State of Stream Processing with Apache Flink
-"What is so hot ?"
-"Deep learning"-"What is so hot in deep learning ?"
-"TensorFlow"
TensorFlow & Apache Flink immediately caught my eye. The basic idea is "TF Graph as a Flink map function" for inference after preprocessing data to off-heap tensor. Online learning is a future direction and the project is open sourced on GitHub.
Lightbend's Dean Wampler discussed about how to do Deep Learning with Flink generally, from the challenges in (mini-batch / distributed / online) training and inference to practical approaches, leveraging such Flink features as side inputs and async I/O. He also recommended "Do"s and "Don't"s for the road ahead. One interesting "Don't" is PMML.
PMML - doesn't work well enough. Not really that useful ?
At least ING uses PMML models to bridge the offline training with Spark and online scoring with Flink streaming. The striking part is how they've decoupled "What" (DAG) and "How" (behavior of each node on the DAG) in the scoring application. Model definition is added at runtime through a broadcast stream without downtime. The same applies for data persist and feature extraction.
By the way, Apache Storm has added PMML support in 1.1.0.
-"Which machine learning (ML) framework is the best ?" -"All of them"
That's Ted Dunning's answer after learning that his customers typically use 12 ML packages and the smallest number is 5. That's why he didn't talk about Flink ML in Machine Learning on Flink. Even ML is not the key here.
90%+ of effort is logistics, not learning.
It is the data. Record raw data, use streams to keep data around, and measure and understand (with meta-data) everything. Another thing is to make deployment easier with containerization.
One more lesson for me is there is no such thing as one model.
You will have dozens of models, likely hundreds to thousands.
Blink is Alibaba's fork of Flink. For large scale streaming (> 1000 nodes) at Alibaba, Blink added a bunch of runtime improvements. (the right side of "=>" is the problem to solve)
What's cool is the improvements are being contributed back to the community.
Uber shared the evolution of their business needs and their system evolved accordingly from event processing, to OLAP, to Streaming SQL on Flink.
70-80% of jobs can be implemented via SQL.
For more on Flink's SQL API, check out Table & SQL API – unified APIs for batch and stream processing and Blink's Improvements to Flink SQL And TableAPI
This is one of Uber's future discussions. From the official site,
Apache Beam provides an advanced unified programming model, allowing you to implement batch and streaming data processing jobs that can run on any execution engine.
Beam has recently added State and Timer support to unlock new use cases which are portable across runners (e.g. Flink).
I'd like to wrap up with Stephen Ewen's answer and high level view of Streaming.
2016 was the year when streaming technologies became mainstream
2017 is the year to realize the full spectrum of streaming applications