Apache Storm
The ever-increasing growth in the production and analytics of Big Data keeps presenting new challenges, and the data scientists and programmers gracefully take it in their stride – by constantly improving the applications developed by them. One such problem was that of real-time streaming. Real-time data holds extremely high value for businesses, but it has a time-window after which it loses its value – an expiry date, if you will. If the value of this real-time data is not realised within the window, no usable information can be extracted from it. This real-time data comes in quickly and continuously, therefore the term “Streaming”.
Analytics of this real-time data can help you stay updated on what’s happening right now, such as the number of people reading your blog post, or the number of people visiting your Facebook page. Although it might sound like just a “nice-to-have” feature, in practice, It is essential. Imagine you’re a part of an Ad Agency performing real-time analytics on your ad-campaigns – that the client paid heavily for. Real-time analytics can keep you posted on how is your Ad performing in the market, how the users are responding to it, and other things of that nature. Quite an essential tool if you think of it this way, right?
Looking at the value that real-time data holds, organisations started coming up with various real-time data analytics tools. In this article, we’ll be talking about one of those – Apache Storm. We’ll look at what it is, the architecture of a typical storm application, it’s core components (also known as abstractions), and its real life-use cases.
Apache Storm – released by Twitter, is a distributed open-source framework that helps in the real-time processing of data. Apache Storm works for real-time data just as Hadoop works for batch processing of data (Batch processing is the opposite of real-time. In this, data is divided into batches, and each batch is processed. This isn’t done in real-time.)
Apache Storm does not have any state-managing capabilities and relies heavily on Apache ZooKeeper (a centralised service for managing the configurations in Big Data applications) to manage its cluster state – things like message acknowledgments, processing statuses, and other such messages. Apache Storm has its applications designed in the form of directed acyclic graphs. It is known for processing over one million tuples per second per node – which is highly scalable and provides processing job guarantees. Storm is written in Clojure which is the Lisp-like functional-first programming language.
At the heart of Apache Storm is a “Thrift Definition” for defining and submitting the logic graph (also known as topologies). Since Thrift can be implemented in any language of your choice, topologies can also be created in any language. This makes Storm support a multitude of languages – making it all the more developer friendly.
Storm runs on YARN and integrates perfectly with the Hadoop ecosystem. It is a true real-time data processing framework having zero batch support. It takes a complete stream of data as an entire ‘event’ instead of breaking it into series of small batches. Hence, it is best suited for data which is to be ingested as a single entity.
Let’s have a look at the general architecture of a Storm application – It’ll give you more insights into how Storm works! A Hitchhiker’s Guide to MapReduce
Apache Storm: General Architecture and Important Components
There are essentially two types of nodes involved in any Storm application (as shown above).
- Master Node (Nimbus Service)
If you’re aware of the inner-workings of Hadoop, you must know what a ‘Job Tracker’ is. It’s a daemon that runs on the Master node of Hadoop and is responsible for distributing task among nodes. Nimbus is a similar kind of service for Storm. It runs on the Master Node of a Storm cluster and is responsible for distributing the tasks among the worker nodes.
Nimbus is a Thrift service provided by Apache which allows you to submit your code in the programming language of your choice. This helps you write your application without having to learn a new language specifically for Storm.
As we talked earlier, Storm lacks any state managing capabilities. The Nimbus service has to rely on ZooKeeper to monitor the messages being sent by the worker nodes while processing the tasks. All the worker nodes update their task status in the ZooKeeper service for Nimbus to see and monitor.
- Worker Node (Supervisor Service)
These are the nodes responsible for performing the tasks. Worker nodes in Storm run a service called Supervisor. The Supervisor is responsible for receiving the work assigned to a machine by the Nimbus service. As the name suggests, Supervisor supervises the worker processes and to help them complete the assigned tasks. Each of these worker processes executes a subset of the complete topology. How Big Data and Machine Learning are Uniting Against Cancer
A Storm application has essentially four components/abstractions that are responsible for performing the tasks at hand. These are:
- TopologyThe logic for any real-time application is packaged in the form of a topology – which is essentially a network of bolts and spouts. To understand better, you can compare it to the MapReduce jobs (read our article on MapReduce if you’re unaware of what that is!). One key difference is that the MapReduce job finishes when its execution is complete, whereas a Storm topology runs forever (unless you explicitly kill it yourself). The network consists of nodes that form the processing logic, and links (also known as the stream) that demonstrate the passing of data and execution of processes.
- StreamYou need to understand what are tuples before understanding what are streams. Tuples are the main data structures in a Storm cluster. These are named lists of values where the values can be anything from integers, longs, shorts, bytes, doubles, strings, booleans floats, to byte arrays. Now,.streams are a sequence of tuples that are created and processed in real-time in a distributed environment. They form the core abstraction unit of a Spark cluster.
- SpoutA sprout is the source of streams in a Storm tuple. It is responsible for getting in touch with the actual data source, receiving data continuously, transforming those data into the actual stream of tuples and finally sending them to the bolts to be processed. It can be either reliable or unreliable. A reliable Spout will replay the tuple if it failed to be processed by Storm, an unreliable Spout, on the other hand, will forget about the tuple soon after emitting it.
- BoltBolts are responsible for performing all the processing of the topology. They form the processing logic unit of a Storm application. One can utilise bolt to perform many essential operations like- filtering, functions, joins, aggregations, connecting to databases, and many more.
Who Uses Storm?
Although a number of powerful and easy to use tools have their presence in the market of Big Data, Storm finds a unique place in that list because of its ability to handle any programming language you throw at it. Many organisations put Storm to use (find a comprehensive list here).
Let’s look at a couple of big players that use Apache Storm and how!
Twitter uses Storm to power a variety of its systems – from the personalisation of your feed, revenue optimisation, to improving search results and other such processes. Because Twitter developed Storm (which was later bought by Apache and named Apache Storm), it integrates seamlessly with the rest of Twitter’s infrastructure – the database systems (Cassandra, Memcached, etc.), the messaging environment (Mesos), and the monitoring systems.
- Spotify
Spotify is known for streaming music to over 50 million active users and 10 million subscribers. It provides a wide range of real-time features like music recommendation, monitoring, analytics, ads targeting, and playlist creations. To achieve this feat, Spotify utilises Apache Storm.
Stacked with Kafka, Memcached, and netty-zmtp based messaging environment, Apache Storm enables Spotify to build low-latency fault-tolerant distributed systems easily. They Say Data is the New Oil – Is it Really True?