Weekly Reading 0x6

2016-05-29Home

I should apologize (to my self, at least) for that I skipped the weekly reading last week. I gave a sharing on Gearpump at 5th Nanjing Big Data Tech Meetup on Saturday and I'd been busy preparing materials the week before. The only news that caught my eyes was Inside Palantir, Silicon Valley’s Most Secretive Company on Palantir's grow-up struggles. I was not convinced since losing customers could happen to any startups. Now please read The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles. Sold?

Let's get down to this week's readings.

Streaming

  • Apache Kafka 0.10 and Confluent Platform 3.0 is announced. Highlights of Kafka 0.10

    • Kafka Streams available
    • Rack awareness so that replicas are guaranteed to span multiple racks or available zones
    • Timestamps in messages indicates the time message produced
    • Kafka Consumer max records

    More on Kafka Streams. Unlike other distributed streaming engines (e.g. Storm, Spark Streaming), a Kafka Streams instance (program) is simply a Java process which is run on one ore more threads.

    Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.

  • Twitter just open sourced Heron, the successor to Storm as real-time stream processing engine at Twitter. It provides backward compatibility with Storm's Topology API. Please go to their website for more information

  • Microsoft published a new paper StreamScope: Continuous Reliable Distributed Processing of Big Data Streams in NSDI '16. After a quick glance, they didn't go beyond watermark.

Spark

Others

  • Julia Evans shared some advice on writing blog posts.

    I really like writing short blog posts because I have a short attention span and I find short blog posts easier to digest when other people write them.

Yes, blogging should never burden you or your readers. That's it.