Here’s a little quiz for you. What is YARN? A) spun thread used for knitting, weaving, or sewing, B) a long or rambling story, especially one that is implausible, or C) a cluster management technology that is one of the key features in the second-generation Hadoop 2 data processing system. If you guessed letter C then you are correct!
For those who need a quick review, essentially Hadoop is a Big Data storage system that takes in large amounts of data from servers and breaks it into smaller, manageable chunks. The technology is complex but at a high level the Hadoop ecosystem essentially takes a “divide and conquer” approach to processing Big Data instead of processing data in tables, as in a relational database like Oracle or MySQL.
MapReduce is really the key to understanding Hadoop’s parallel processing functionality as it enables data in various formats (XML, text, binary, log, SQL, etc.) to be divided up and mapped out to many computer nodes and then recombined back to produce a final data set. The framework is organized as a “map” function, which transforms a piece of data into some number of key/value pairs. Each of these elements are then sorted by their key and sent to the same node, where a “reduce” function is used to merge the values (of the same key) into a single result.
What YARN (short for “Yet-Another-Resource-Negotiator”) really represents is an updated way for the Hadoop ecosystem to manage its data; it grew out of an effort to re-architect some of the limitations observed in the original version of Hadoop. By taking the resource management capabilities that were in MapReduce, YARN packages them so they can be used by new engines and extended to other applications; this reconfiguration results in a number of direct benefits.
Here are the 5 key takeaways to know about YARN:
The size of data is growing exponentially. The upper limit of Hadoop/MapReduce 1 was around 4000 machines per cluster. By reconfiguring the ResourceManager to focus exclusively on job scheduling, YARN will be able to manage larger clusters much more efficiently – with clusters of 10,000 machines and beyond.
Ensures that customers’ existing MapReduce applications run unchanged on top of YARN without any disruption.
The improved architecture gets rid of the fixed typed, inefficient map and reduce slots in Hadoop MapReduce and replaces them with a generic resource container model where container allocation is based on application-defined resource request types (locality, memory, CPU, I/O bandwidth, ect.).
The newly configured Hadoop 2 environment allows multiple programming paradigms (i.e., MapReduce, MPI, Master-Worker) to be incorporated for data processing, such as graph processing and iterative modeling. These added frameworks allow enterprises to realize near real-time processing and increased ROI on their Hadoop investments.
An added advantage of the upgraded architecture is that MapReduce essentially becomes a user-land library. This means that end users can use different versions of MapReduce concurrently on the same cluster and/or upgrade their applications to MapReduce versions on their own schedule. This flexibility adds a layer of agility and allows for innovation without affecting the stability of the software.
In the last several years Hadoop has scaled up to become the most popular Big Data processing ecosystem on the market today. And while small businesses might be tempted to overlook Hadoop because of its complexity, it’s good to keep in the loop on some of the latest improvements and innovations in the architecture. As one writer has well summarized, YARN “is a bit hard to describe because like Hadoop it isn’t all one thing. At its base YARN is a cluster manager for Hadoop. . . . Yet part of the motivation for YARN is that the Hadoop ecosystem expands beyond algorithms and technologies based on MapReduce.”
Today’s data sizes will soon seem miniscule compared to what’s next, especially considering the massive growth expected in the Internet of Things market. The biggest takeaway here is that while Big Data is getting much bigger, YARN represents a next generation solution to managing these increasing volumes. If you haven’t done so yet, please consider building a strategy to get onboard with the Hadoop/YARN ecosystem and start exploring relevant use cases that will help your organization keep up with its Big Data.