Introduction to Hadoop

Hadoop is an open-source implementation of Google MapReduce, including a distributed file system. By reading this document you will learn about the Map-Reduce paradigm.

What is Hadoop?

Hadoop is sub-project of Lucene (a collection of industrial-strength search tools), under the umbrella of the Apache Software Foundation. Hadoop parallelizes data processing across many nodes (computers) in a compute cluster, speeding up large computations and hiding I/O latency through increased concurrency. Hadoop is especially well-suited to large data processing tasks (like searching and indexing) because it can leverage its distributed file system to cheaply and reliably replicate chunks of data to nodes in the cluster, making data available locally on the machine that is processing it.

Hadoop is written in Java. Hadoop programs can be written using a small API in Java or Python. Hadoop can also run binaries and shell scripts on nodes in the cluster provided that they conform to a particular convention for string input/output.

Hadoop provides to the application programmer the abstraction of map and reduce (which may be familiar to those with functional programming experience). Map and reduce are available in many languages, such as Lisp and Python.

Data Structures

Hadoop operates on <key, value> pairs, or two-tuples. The value can be any data type, but must be serializable since Hadoop transmits data over the wire as strings. All data that your Hadoop programs produce must be formatted as pairs. An example of output in pairs is word counts, i.e.:

a      37
the    20
and    16
zygote 1

Map and Reduce

The MapReduce paradigm takes inspiration from the map and reduce programming constructs common in many programming languages. We introduce map and reduce using Python, then draw out differences between map/reduce in most programming languages and the map/reduce phases of Hadoop.

Map

Map applies a function to each element in a list, returning a list of the results. For example, here is code in Python which uses map to triple each element in a list:

def triple(n):
    return n * 3

print map(triple, [1, 2, 3, 4])

This code prints the following list:

[3, 6, 9, 12]

Reduce

Reduce also applies a function to elements in a list, but instead of being applied to each element separately, the function is applied to two arguments, the "current result" and the next element in the list. The current result is initialized by calling reduce on the first two elements in the list. This allows you to build a single result (which can be another list but is often a scalar value) from a list. This is best illustrated in another simple Python example:

def sum(n1, n2):
    return n1 + n2

print reduce(sum, [1, 2, 3, 4])

Can you guess what this will print? Try it if you are unsure. You can think of this function as making three recursive function calls like this:

sum(sum(sum(1, 2), 3), 4)

Parallelizing Map and Reduce with Hadoop

In the MapReduce programming abstraction, the map function takes a single <key, value> pair, and produces zero or more <key, value> pairs as a result. This differs slightly from the functional programming construct map which always produces one and only one result for each invocation of map. The MapReduce style of map allows you to produce many intermediate pairs which can then be further analyzed with reduce.

In MapReduce, the reducer is also more flexible than its functional programming counterpart. While reduce is similar in spirit to the reduce described in the Python example above, it is not limited to processing the list of pairs two-at-a-time, but rather is given an iterator over all pairs that have the same key, and that list can be walked over in any way the programmer chooses. Also like the MapReduce map, the MapReduce reduce can emit an arbitrary number of pairs, although applications often will want to just reduce to a single output pair.

A Quick Illustration of Reduce

The reduce phase in MapReduce joins together in some way those pairs which have the same key. For example, say that you do a word count on N documents, and each document is on a separate node. Your map function will process each document separately, producing many pairs of the following form: <word, count>. The documents will most likely have many words in common, so those counts will need to be combined. The MapReduce algorithm automatically starts a reduce process for each set of pairs with the same key (in this case, counts for the same word), and the reducer can simply sum those counts together, producing a single pair of the form <word, total_count>. For example, reduce might transform one set of pairs into another like this:

a      37
the    20                      a      50
a      10                      and    16
a      3      -- reduce -->    the    32
and    16                      zygote 1
zygote 1
the    12

There is a word count example written in Java distributed with Hadoop in src/examples/org/apache/hadoop/examples/WordCount.java

By separating the map task (where the computation on each input element is independent of the other) from the reduce task (where pairs with the same key must be processed together on the same node), the MapReduce algorithm can improve parallelism. This is the reason why MapReduce separates map and reduce into two separate phases.

What's Next?

Set up Hadoop to run on a single machine (such as your laptop or home computer).