Storm Basics

2015-10-30Home

Apache Storm is a distributed stream processing system (streaming system), which provides high performance unbounded data processing. I believe Storm is the first widely used streaming system and is to streaming what Hadoop is to batch. Emerging streaming systems make themselves known by comparing with Storm and claim they have better performance than Storm. I highly recommend you read history of apache storm and lessons learned by its creator Nathan Marz.

Storm is evolving fast. The current stable release is 0.9.5, and 0.9.6 and 0.10.0 will soon be released. 0.10.0 will include such big features as Nimbus HA and Kerberos support. Other nice features like Automatic Back Pressure, Tuple Batching and Resource Aware Scheduling are on the way. I've been following Storm for quite a while, from the storm-benchmark project to more recently storm on gearpump. Each time I went to the implementation and knew a bit more but there are still many blind spots. Meanwhile, I don't like the structure of the official documentation. Hence, it's a good time to look back and summarize (esp. for myself) the basic building blocks of Storm. By the way, thanks to Storm, I learned Clojure and entered into the secret garden of Lisp.

This will be a series of Storm Basics articles based on the 0.9.x branch and leave more recent advances for future. I won't cover Trident either since it's a totally add-up layer built on Storm core and too complex for basics.

Overview

Storm is deeply influenced by MapReduce (Nathan is also the author of Cascalog, the Clojure API over Hadoop, the open source implementation of MapReduce). and brings the MapReduce pattern to the streaming world. Some background knowledge of MapReduce and Streaming here will help us understand the philosophy of Storm.

MapReduce

What is MapReduce ?

The problem is how to process large data sets on commondity machines (please refer to the original paper for some background).

Large data sets don't fit in a single machine so we need a distributed system that is able to

utilize hardware resources
schedule computations across machines
transfer data between computations

Commondity machines are likely to fail and hence the system should provide fault tolerance such as

restart computations on the same or other machines
re-send data to restarted computations

Further, users with little experience for distributed systems and concurrency programing could easily use the system.

MapReduce is such a system. It abstracts processing logic into two simple functions, map and reduce. Users only need to specify the functions locally and the system will automatically parallelize them across machines. It handles all the resource management, computation scheduling, data transfer and fault tolerance for users.

In summary, MapReduce is a system to utilize distributed system to process large data sets.

Streaming

What is Streaming ?

In The world beyond batch: Streaming 101, Tyler Akidau defined streaming as

a type of data processing engine that is designed with infinite data sets in mind.

In contrast, batch systems like MapReduce are designed to process finite data sets usually partitioned in batches and pre-loaded into the system.

He further clarifies that such terms as "Unbounded data", "Unbounded data processing" and "Low-latency, approximate, and/or speculative results" should not be taken as streaming although they are typical characteristics.

repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived (and conversely, well-designed streaming systems are more than capable of handling “batch” workloads over bounded data).

Spark is a batch stream while Spark Streaming is a streaming system built on Spark. Flink supports batch mode with an underlying streaming system.

That said, a batch system has an upper bound on latency it can achieve while a well designed streaming system provides accurate results. Hence, streaming is a superset of batch.

Storm

As a streaming system, Storm is designed to process infinite data sets (messages). It processes each message as soon as its arrival, emits result per message, and goes to next message and loops forever. Storm inherits the MapReduce model that users write local functions and the system handles the parallelization, resource management, fault tolerance, etc. It extends the computation logic from map and reduce to a directed graph called Topology. Storm provides low latency, high throughput and at-least-once message guarantee. These topics will be covered in the following posts.

I'd like to conclude this overview with a quick start guide to try out Storm.

Quick Start

You should have Java, Zookeeper and Storm installed locally.

# 1. launch zookeeper 
bin/zkServer.sh start            

# 2. start nimbus 
bin/storm nimbus

# 3. start supervisor 
bin/storm supervisor

# 4. submit topology
bin/storm jar examples/storm-starter/storm-starter-topologies-${version}.jar storm.starter.ExclamationTopology exclamation
   
# 5. start web ui 
bin/storm ui

Open localhost:8080 on your browser and the topology should already be running.