2016-05-08Home
I'd like to start this week with something other than code but how to present code. There are 5 basic rules to follow,
Remember your slides are not your IDE
Now let's get down to code.
Unlike the batch world where Spark "rules" (Tez, Flink anyone?), the streaming world has entered into a war era. There are so many streaming solutions, each of which has its own pros and cons. Need an apple-to-apple comparison ? An Overview of Apache Streaming Technologies is for you.
Although listed in the previous comparison, Apache Beam is not yet another streaming technology but
provide the world with an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms
It's formerly Google Cloud Dataflow SDK and requires a runner (e.g. Google Cloud Dataflow, Flink, Spark) to work. Why Apache Beam? A Google Perspective explains why (open sourcing) the project makes sense for Google, and from the business perspective too.
That motivation hinges primarily on the desire to get as many Apache Beam pipelines as possible running on Cloud Dataflow.
Beam has such nice feature as auto-scaling. Streaming Auto-scaling in Google Cloud Dataflow has more details.
Apache Kafka has a data structure called purgatory, which holds any request that hasn't yet met its criteria to succeed but also hasn't yet resulted in an error. Apache Kafka, Purgatory, and Hierarchical Timing Wheels talks about how Kafka efficiently keep track of tens of thousands of requests that are being asynchronously satisfied by other activity in the cluster. Hierarchical Timing Wheels is really a great data structure to know.
Storm 1.0.1 is released as a maintenance release that includes a number of important bug fixes that improve Storm's performance, stability and fault tolerance.
“Does the Database Community Have an Identity Crisis?’’
Think about it. I'll leave you here.