Apache Storm

Storm definition

Storm_logoApache Storm is a low-latency, high-availability real-time distributed computing system based on master-slave architecture.
Storm is ideal for working with data that need to be analyzed in real time where latency is a variable to take into account, an example of this would be the IoT sensors.

Storm delegates in ZooKeeper the maintenance of the state of their instances.

Features

  • Integrates with queuing systems such as Kestrel, RabbitMQ/AMQP, Kafka, JMS or Amazon Kinesis and database systems.
  • Provides a simple, easy-to-use API.
  • Scalable
  • Fault Tolerant
  • Easy to set up and deploy
  • Programmable in different languages such as: Ruby, Python, Javascript or Perl.
  • Ensures that a tuple is fully processed

Architecture

Storm possesses a master-slave architecture:

  • Maestro: Called Nimbus, is responsible for distributing the code through the cluster and perform the assignment and monitoring of tasks in the different nodes that compose the cluster.
  • Slave: Called Supervisor, is in charge of collecting and processing the works that are assigned to him in the machine where it is executed. The slave nodes cause the work to be distributed throughout the cluster and, in case one of them failed, the master would know thanks to Zookeeper and redirect the tasks to another node.
  • Zookeeper: is responsible for coordinating Nimbus and supervisors, for it stores the status of each one of them at any time.
Storm is made up of three levels of abstraction.
  • Spout  takes care of collecting input data flow
  • Bolt is responsible for processing or transforming the data
  • Topology is similar to a graph. Each node is responsible for processing a certain information and passes the witness to the next node.
Storm Architecture

Storm Architecture

Storm has two operating modes:

  • Local mode is very useful to test the developed code as it runs on a single Java virtual machine, this mode simulates with threads of the different nodes of the cluster.
  • Cluster mode or production runs the topology in cluster, i.e. it distributes and executes our code on the different machines.

Source: Official website