I’ve just finished one of the best books I’ve found on the subject of MapReduce. If you’ve perused the market for good references and how-to guides on Hadoop and MapReduce, you know that the good ones are few and far between. This one is worthy of your time.
In the tradition of seminal works like Design Patterns: Elements of Reusable Object-Oriented Software, Don Miner of EMC Greenplum and Adam Shook of ClearEdge IT Solutions have combined the collective wisdom of the entire MapReduce community into a book that is part cookbook and part lesson book. It is written for a broad audience. If you are new to MapReduce and want to learn how to write effective MapReduce code, or if you are an experienced developer and want a handy reference, you will find this book valuable. It also has another audience: people who never intend to write a line of MapReduce code but want to understand what Hadoop and MapReduce can and cannot do.
That’s important, because there are a lot of people out there who have been hearing the Hadoop hype but do not understand its limitations. The MapReduce paradigm can be very powerful for the right type of problem, but the classes of problems it can solve are finite. Miner and Shook lay out the different classes of problems that are suitable for solution in MapReduce, and then describe the various design patterns that can be used to implement those solutions. The examples are in Java MapReduce, but the code examples are less important than the design concepts presented. The book is easily consumable by people with a basic technical background.
The authors describe a MapReduce design pattern as, “a template for solving a common and general data manipulation problem with MapReduce.” They go on to say that “Using design patterns is all about using tried and true design principles to build better software.” This book is a collection of design patterns for solving problems in classes including summarization, binning, joining, and more.
As an example, lets look at the partitioning pattern. The authors follow a common structure in describing each pattern, starting with a description of what the design pattern is intended to accomplish. For partitioning, the idea is to read a data set and group similar records together into smaller data sets. The motivation for the pattern is described next, giving insight into the where you might encounter a need to partition data in the real world.
Next, the authors describe the necessary conditions for the pattern to be applicable, calling it – predictably – applicability. In the partitioning case, one of the conditions is that you must know ahead of time how many output partitions you need.
Then comes the cool part: the structure of the pattern laid out in a diagram and accompanied by a lucid description of each component and how it works to solve the problem. Here is the structure of the partitioning pattern:
Next, the authors describe the consequences (results) of the pattern - in this case a set of files containing the various partitions. Known uses of the pattern are listed, such as sharding or partioning by a category. This is followed by a very useful section showing the resemblances the pattern has to similar patterns or paradigms in other languages such as partitioned tables in a relational database.
Performance is always at the top of my mind when designing any solution, and so I really like the performance analysis section in which the authors describe the performance concerns with the implementation of the pattern, as well as give pointers on how to handle these challenges. This is knowledge born of experience, and it is worth more than the price of the book by itself. I’ve spent many a dark and caffeine-driven night earning the kind of experience that is laid out clearly in these pages.
And last but certainly not least, examples detail how to code the pattern. Examples are given in Java, and the explanations are informative and easy to understand. The partitioning example partitions a data set by date, and the authors provide driver code, mapper code, partitioner code, and reducer code along with explanations for what the code is doing and why.
For what more could you ask?
Miner and Shook have a passion for their subject and are also deeply knowledgeable, and it shines through in their writing. In MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, they have published a work that serves as both a great way to learn MapReduce and as a valuable reference that I expect will be dust-free on your shelf for quite some time. It will certainly be on mine.
MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems is due to be released on December 22, 2012.
Order the Trade Paperback Edition here.
Order the Kindle Edition here.