Weekly Reading 0x7

2016-06-19Home

So long since last time that the "Weekly Reading" is almost becoming "Monthly Reading". It is a (bad) sign that I haven't read much these days. One thing I've been up to is the upcoming Shanghai BigData Streaming 3rd Meetup. Let's chat on Big Data Streaming and European Cups over a cup of beer. Another is the Gearpump Runner for Apache Beam. The reading list is also of good contents on Streaming and Apache Beam / Google Cloud Dataflow.

Streaming

We firstly introduced Apache Beam in week 4 and then looked at Google's and dataArtisans's perspectives. Future-proof and scale-proof your code gives another two reasons to use Apache Beam.
- Future-proofing
  
  Future-proofing code means that we’ll be able to run it on new technologies as they come out, without having to re-write the code.
- Scale-proofing
  
  As data grows, “scale-proofing” code means that we can start out with small data, and have an API that grows with us.
I think, with Apache Beam, it's cheap to try out another platform and carry out benchmarking. Plus, there will be no difference between writing codes run on single machine and cloud.

This is how wordcount looks like with Beam API. Each apply function takes a PTransform parameter, like ParDo and Count.

Pipeline p = Pipeline.create(options);
p.apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() {
                    @Override
                    public void processElement(ProcessContext c) {
                      for (String word : c.element().split("[^a-zA-Z']+")) {
                        if (!word.isEmpty()) {
                          c.output(word);
                        }
                      }
                    }
                  }))
 .apply(Count.<String>perElement())                 
 .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
p.run();

Beam is inspired by FlumeJava but it has replaced FlumeJava's methods on PCollection with the ubiquitous PTransform. Where's my PCollection.map()? looks at the history and design decisions behind this.

Besides Apache Beam, Google Cloud Dataflow offers two advanced features
- Autoscaling dynamically adjusts the number of workers to the needs of pipeline. It is especially useful for streaming application where input data rate varies over time.
- Liquid Sharding addresses the problems of stragglers through asking busy workers to give away unprocessed work to free workers.
Reactive Kafka is an Akka Streams connector for Apache Kafka. Check out Reducing Microservice Complexity with Kafka and Reactive Streams for examples.

Scala

Li Haoyi demonstrates how to micro-optimize your Scala code, which is full of useful techniques.

Beyond

How to solve hard programing problems ? Julia Evans comes up with three ways.
- Realize that there is awesome existing software that you can repurpose
- Steal an idea
- Come up with a new idea
Intel x86s hide another CPU that can take over your machine (you can't audit it). Take it with a grain of salt.

That's all for this week. See you in the meetup.

ManuZhang's Blog

Static Blog generated in Scala

Weekly Reading 0x7

Streaming

Scala

Beyond