2016-06-19Home
So long since last time that the "Weekly Reading" is almost becoming "Monthly Reading". It is a (bad) sign that I haven't read much these days. One thing I've been up to is the upcoming Shanghai BigData Streaming 3rd Meetup. Let's chat on Big Data Streaming and European Cups over a cup of beer. Another is the Gearpump Runner for Apache Beam. The reading list is also of good contents on Streaming and Apache Beam / Google Cloud Dataflow.
We firstly introduced Apache Beam in week 4 and then looked at Google's and dataArtisans's perspectives. Future-proof and scale-proof your code gives another two reasons to use Apache Beam.
Future-proofing
Future-proofing code means that we’ll be able to run it on new technologies as they come out, without having to re-write the code.
Scale-proofing
As data grows, “scale-proofing” code means that we can start out with small data, and have an API that grows with us.
I think, with Apache Beam, it's cheap to try out another platform and carry out benchmarking. Plus, there will be no difference between writing codes run on single machine and cloud.
This is how wordcount looks like with Beam API. Each apply
function takes a PTransform
parameter, like ParDo
and Count
.
Pipeline p = Pipeline.create(options);
p.apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(Count.<String>perElement())
.apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
p.run();
Beam is inspired by FlumeJava but it has replaced FlumeJava's methods on PCollection
with the ubiquitous PTransform
. Where's my PCollection.map()
? looks at the history and design decisions behind this.
Besides Apache Beam, Google Cloud Dataflow offers two advanced features
Reactive Kafka is an Akka Streams connector for Apache Kafka. Check out Reducing Microservice Complexity with Kafka and Reactive Streams for examples.
How to solve hard programing problems ? Julia Evans comes up with three ways.
Intel x86s hide another CPU that can take over your machine (you can't audit it). Take it with a grain of salt.
That's all for this week. See you in the meetup.