Hadoop and Installation
Q8. I tried installing Hadoop Single Node Cluster. But I am lost ?
If you are more of a visual person, here is a Youtube video, which goes into installation in detailed, sometimes painstakingly detailed, steps.
I would however request you to go via, Virtualbox route. Download Virtual Box and then download the image from Cloudera. It is as simple as Download and Double Click. Here is the link http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
Q7. I have windows, how can I install Hadoop ?
I would strongly advise against using windows as the development platform. Easiest solution in to use VirtualBox and use the cloudera's quickstart VM. Download it here.
If you insist on doing it on Windows, and in the process losing some hair from your head, like I did, here you go: https://wiki.apache.org/hadoop/Hadoop2OnWindows
Assignment 0 and Basics
Q6.I noticed that there are two submissions for the big data parcer assignment one is the right one ?
You may submit to either one of those.
Q5. "Not Enough Disk Space left on Device" ?
Since you need to submit output for only last 500 records, do not save the complete file. Part of learning to handle huge dataset is to be selective. Here it is simple, just save the last 500. In real applications, you would select these records based on some criteria.
Q4. What is head, ssh, ... ? I do not have 'much' experience with Linux ?
Linux (or Unix) is a simple and powerful OS. There are plentiful resources on the web to learn basic commands of Unix. If you only want to know about specific commands the man page is helpful. Type man . If you want a crash course on Unix terminal commands, this is a very good resource http://www.ee.surrey.ac.uk/Teaching/Unix/
Q3. I do not have a CS account ? What do I do ?
Since you do not have the account, you can access the all.txt using this link http://www.cs.brandeis.edu//~cs129a/all.txt Remember, this is a HUGE file. Don't just open it in browser. If you have to Download it. If the download speed is slow you may use this link http://snap.stanford.edu/data/amazon/all.txt.gz and then use gunzip to extract them.
Q2. Which version of Hadoop do I install ?
Please install Hadoop 2.3.0. Although assignment 0 does not require hadoop, subsequent assignment does. YOU DON't have to install hadoop to do your assignments, we will provide you infrastructure which already has hadoop installed. However if your laptop/desktop is efficient enough and you have around 40 GB free space, we encourage you to install hadoop. Detail steps will be provided in the class and upcoming tutorial session.
Q1. Where is this file "all.txt" ? I do not understand the given location
That is the location on the CS Department NFS. (file server). I hope you are a CS student and you can login to any of these machines http://www.cs.brandeis.edu/~guru/public_work_stations.html You could either "ssh" to these machines from your laptop or use them from Vertica. If you do not have cs dept account and/or cannot login to those machines, then let me know. In that case I will have to provide something else for you. With the Big Data it is expected that you do not have a copy of the data but access it from the available server