Weekly Reading 0x7

2016-06-19Home

So long since last time that the "Weekly Reading" is almost becoming "Monthly Reading". It is a (bad) sign that I haven't read much these days. One thing I've been up to is the upcoming Shanghai BigData Streaming 3rd Meetup. Let's chat on Big Data Streaming and European Cups over a cup of beer. Another is the Gearpump Runner for Apache Beam. The reading list is also of good contents on Streaming and Apache Beam / Google Cloud Dataflow.

Streaming

  • We firstly introduced Apache Beam in week 4 and then looked at Google's and dataArtisans's perspectives. Future-proof and scale-proof your code gives another two reasons to use Apache Beam.

    • Future-proofing

      Future-proofing code means that we’ll be able to run it on new technologies as they come out, without having to re-write the code.

    • Scale-proofing

      As data grows, “scale-proofing” code means that we can start out with small data, and have an API that grows with us.

    I think, with Apache Beam, it's cheap to try out another platform and carry out benchmarking. Plus, there will be no difference between writing codes run on single machine and cloud.

  • This is how wordcount looks like with Beam API. Each apply function takes a PTransform parameter, like ParDo and Count.

    Pipeline p = Pipeline.create(options);
    p.apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() {
                        @Override
                        public void processElement(ProcessContext c) {
                          for (String word : c.element().split("[^a-zA-Z']+")) {
                            if (!word.isEmpty()) {
                              c.output(word);
                            }
                          }
                        }
                      }))
     .apply(Count.<String>perElement())                 
     .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
    p.run();
    

    Beam is inspired by FlumeJava but it has replaced FlumeJava's methods on PCollection with the ubiquitous PTransform. Where's my PCollection.map()? looks at the history and design decisions behind this.

  • Besides Apache Beam, Google Cloud Dataflow offers two advanced features

    • Autoscaling dynamically adjusts the number of workers to the needs of pipeline. It is especially useful for streaming application where input data rate varies over time.
    • Liquid Sharding addresses the problems of stragglers through asking busy workers to give away unprocessed work to free workers.
  • Reactive Kafka is an Akka Streams connector for Apache Kafka. Check out Reducing Microservice Complexity with Kafka and Reactive Streams for examples.

Scala

Beyond

That's all for this week. See you in the meetup.