Biweekly Reading 0xC

2016-10-30Home

Last week, I made use of my three-hour shuttle time watching Reactive Summit 2016. Since YouTube wouldn't allow me to cache videos so I downloaded them to my laptop with a "kiss".

Reactive Summit

Distributed stream processing with Apache Kafka

My first pick is Jay Kreps' keynote on Distributed stream processing with Apache Kafka. His articles have never failed me and this talk is no exception. He firstly argued there is an intersection between streaming processing and micro-services, which his talk is about. Then he categorized computer programming into 3 paradigms based on input / output.

  1. Request / Response. One Input and one output, and only process future input.
  2. Batch. A batch of inputs and outputs, and only process past input.
  3. Stream Processing. Generalization of 1 and 2.

I couldn't agree more that stream processing isn't necessarily transient, approximate and lossy. After introducing the challenges in stream processing and micro-services, he spent rest of the time on how to solve the hard parts with Kafka, Kafka Streams, Kafka Connect and Confluent platform.

I really enjoy Jay's keynote, especially the first half.

bla bla microservices bla bla: Director’s Cut

Akka inventor Jonas Boner distilled into micro-services and how to build distributed systems on it. My most experiences with micro-services come from Akka and Gearpump, where different system roles (actors) work asynchronously and communicate only through messages. I have no idea how micro-services work as a backend across groups in a big company. Hence, Jonas' talk, full of good quotes, looks more like a bird view to me. Nevertheless, I do remember he said

There is no such thing as a "stateless" architecture. It's just someone else's problem.

I believe state management and "stateful" API is a must-have for a stream processing system.

Scala and the JVM as a Big Data Platform - Lessons from Apache Spark

Dean Wampler shared how easy it is to develop distributed applications with Spark. Meanwhile, the JVM has significant GC problems and Spark is fixing it with project "Tungsten". What caught my eyes is the OOM issue caused by copy of 2.2GB array. The codes below, although written in Scala REPL, will be compiled into a JVM class. Then the instance, closure over b, will be serialized and shipped to remote cluster with all of its fields including the 2.2GB array.

scala> val N = 1100 * 1000 * 1000
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
     | map(i => b.value(i))

The solution is to either mark the array as @transient or put it into a singleton, a companion object. Please check the slides for more details.

The Zen of Erlang

Akka brings Erlang's actor model to JVM. The Zen of Erlang, by Fred Hebert, is almost the design patterns of Akka except for preemptive scheduling. We can't do blocking stuff in actor if a shared-thread-pool dispatcher is used. Otherwise, the whole system will be blocked.

Built on Akka, Gearpump employs Akka's supervision tree for fault tolerance. The supervisor will restart an application on task failures. Why restarting works ? According to Fred, there are such thing as transient bugs which are hard to find in development but happen in production all the time. Restarting heals it.

That's my impression on Reactive Summit so far. I'll share more as I read on. Let's turn our attention towards the Big Data community.

Releases

Big Data Systems

Software

Think about it.