These instructions will show you how to run Hadoop on a single machine, simulating a cluster by running multiple Java VMs. This setup is great for developing and testing Hadoop applications. The Hadoop website has an excellent tutorial on installing and setting up Hadoop on a single node. This document will supplement that tutorial with some tips and gotchas. You will also write a small sample program that leverages hadoop to fetch titles from web pages.
Java 1.5x is required. You may already have it installed; try typing java -version at the command line to see if you have it installed. It should say version "1.5.x" (e.g., 1.5.0_14).
echo "JAVA_HOME=/usr/java/java1.5" >>~/.bashrc \ && echo "PATH=/usr/java/java1.5/bin:\$PATH" >>~/.bashrcThen log out and back in (or just close and reopen the terminal).
Once Java is installed, go to the command line and type the following command:
java -versionYou should see output similar to this:
java version "1.5.0_13" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-241) Java HotSpot(TM) Client VM (build 1.5.0_13-121, mixed mode, sharing)If you see something to the effect of "command not found" then Java may not be on your PATH. If you see a version other than "1.5.x" then you do not have the correct version of Java installed.
If you have any trouble with Java installation or otherwise getting your computer set up for use with Hadoop, please ask the TA for help as soon as possible.
mkdir -p ~/proj/cs147a mv hadoop-0.15.2.tar.gz ~/proj/cs147a tar xvfz hadoop-0.15.2.tar.gzOn Windows you will have to enter those commands from a Cygwin shell.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>localhost:YOURPORT1</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:YOURPORT2</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.secondary.info.port</name> <value>YOURPORT3</value> </property> <property> <name>dfs.info.port</name> <value>YOURPORT4</value> </property> <property> <name>mapred.job.tracker.info.port</name> <value>YOURPORT5</value> </property> <property> <name>tasktracker.http.port</name> <value>YOURPORT6</value> </property> </configuration>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed.This will happen if you try to run the single-node cluster on two separate machines in the Berry patch (e.g., you work on Pentheus on Tuesday, then work on Gordius on Wednesday), because the name "localhost" points to different machines depending on which machine you are logged into. A solution to this problem can be found in the troubleshooting guide.
$ bin/hadoop jar hadoop-*-examples.jar grep input output2 'dfs[a-z.]+'Or else you can remove the output directory and then re-run the original command:
$ bin/hadoop dfs -rmr output $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
You can see the commands that HDFS allows by typing this at the command line:
bin/hadoop dfsA non-exhaustive list of important commands to remember:
When you are done working with Hadoop, you should always shut it down. To do this, execute this command:
bin/stop-all.sh
Write and test a small Hadoop program.