Biweekly Reading 0xC

2016-10-30Home

Last week, I made use of my three-hour shuttle time watching Reactive Summit 2016. Since YouTube wouldn't allow me to cache videos so I downloaded them to my laptop with a "kiss".

Reactive Summit

Distributed stream processing with Apache Kafka

My first pick is Jay Kreps' keynote on Distributed stream processing with Apache Kafka. His articles have never failed me and this talk is no exception. He firstly argued there is an intersection between streaming processing and micro-services, which his talk is about. Then he categorized computer programming into 3 paradigms based on input / output.

Request / Response. One Input and one output, and only process future input.
Batch. A batch of inputs and outputs, and only process past input.
Stream Processing. Generalization of 1 and 2.

I couldn't agree more that stream processing isn't necessarily transient, approximate and lossy. After introducing the challenges in stream processing and micro-services, he spent rest of the time on how to solve the hard parts with Kafka, Kafka Streams, Kafka Connect and Confluent platform.

I really enjoy Jay's keynote, especially the first half.

bla bla microservices bla bla: Director’s Cut

Akka inventor Jonas Boner distilled into micro-services and how to build distributed systems on it. My most experiences with micro-services come from Akka and Gearpump, where different system roles (actors) work asynchronously and communicate only through messages. I have no idea how micro-services work as a backend across groups in a big company. Hence, Jonas' talk, full of good quotes, looks more like a bird view to me. Nevertheless, I do remember he said

There is no such thing as a "stateless" architecture. It's just someone else's problem.

I believe state management and "stateful" API is a must-have for a stream processing system.

Scala and the JVM as a Big Data Platform - Lessons from Apache Spark

Dean Wampler shared how easy it is to develop distributed applications with Spark. Meanwhile, the JVM has significant GC problems and Spark is fixing it with project "Tungsten". What caught my eyes is the OOM issue caused by copy of 2.2GB array. The codes below, although written in Scala REPL, will be compiled into a JVM class. Then the instance, closure over b, will be serialized and shipped to remote cluster with all of its fields including the 2.2GB array.

scala> val N = 1100 * 1000 * 1000
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
     | map(i => b.value(i))

The solution is to either mark the array as @transient or put it into a singleton, a companion object. Please check the slides for more details.

The Zen of Erlang

Akka brings Erlang's actor model to JVM. The Zen of Erlang, by Fred Hebert, is almost the design patterns of Akka except for preemptive scheduling. We can't do blocking stuff in actor if a shared-thread-pool dispatcher is used. Otherwise, the whole system will be blocked.

Built on Akka, Gearpump employs Akka's supervision tree for fault tolerance. The supervisor will restart an application on task failures. Why restarting works ? According to Fred, there are such thing as transient bugs which are hard to find in development but happen in production all the time. Restarting heals it.

That's my impression on Reactive Summit so far. I'll share more as I read on. Let's turn our attention towards the Big Data community.

Releases

Apache Kafka 0.10.1.0 released including completion of 15 KIPs, over 200 bug fixes and improvements. One notable feature is queryable states for Kafka Streams.
sbt 0.13.13 is available deprecating old sbt 0.12 DSL, <<=, <+=, <++= and tuple enrichments.

Big Data Systems

Blink: How Alibaba Uses Apache Flink introduces Flink's usage at Alibaba search and how it is adapted to meet their unique requirements. The good news is Alibaba are contributing the improvements back to community and transiting to vanilla Flink.
Facebook did a comparison between Giraph and GraphX. The key takeaway is Giraph has larger scale and better performance while GraphX is easier to program.
Databricks is adding support for Spark clusters with GPU to accelerate deep learning workloads. Spark already has integrations with Tensorflow and Caffe.

Software

What's wrong with Git ? A conceptual design analysis. Ever confused with Git's staging concept ? Felt cumbersome to stash or commit unfinished changes before switching branches. The authors show how the conceptual model of Git lead to those difficulties and simplify it in Gitless.
How to code with good taste ? Applying the Linus Torvalds “Good Taste” Coding Requirement looks at Linus' taste and conclude

What I think Linus meant, and what developers who have “good taste” do differently, is that they take the time to conceptualize what they are building before they start.

Think about it.

ManuZhang's Blog

Static Blog generated in Scala