Here’s a discussion of five Big Data processing frameworks: Hadoop, Spark, Flink, Storm, and Samza. An overview of each is given and comparative insights are provided, along with links to external resources on particular related topics.
With the modern world’s unrelenting deluge of data, settling on the exact sizes which make data “big” is somewhat futile, with practical processing needs trumping the imposition of theoretical bounds. Like the term Artificial Intelligence, Big Data is a moving target; just as the expectations of AI of decades ago have largely been met and are no longer referred to as AI, today’s Big Data is tomorrow’s “that’s cute,” owing to the exponential growth in the data that we, as a society, are creating, keeping, and wanting to process. As such, traditional data processing tools which do not scale to big data will eventually become obsolete.
So the question is, what are we doing with this data? The answer, of course, is very context-dependent. But everyone is processing Big Data, and it turns out that this processing can be abstracted to a degree that can be dealt with by all sorts of Big Data processing frameworks. A few of these frameworks are very well-known (Hadoop and Spark, I’m looking at you!), while others are more niche in their usage, but have still managed to carve out respectable market shares and reputations.
We will take a look at 5 of the top open source Big Data processing frameworks being used today. Of course, these aren’t the only ones in use, but hopefully they are considered to be a small representative sample of what is available, and a brief overview of what can be accomplished with the selected tools.