COSI 129a: Introduction to Big Data Analysis

Introduction

The amount of data produced across the globe has been increasing and will continue to grow at an accelerating rate for the foreseeable future. At companies across all industries, servers are overflowing with usage logs, message streams, transaction records, sensor data, business operations records and mobile device data. Effectively analyzing these huge collections of data, “big data” as it is commonly known, can create significant value creating strong demand for experts with ability to carry out such analysis.
Effective big data analysis requires skills in a range of computer science areas such as data storage and processing, statistical data analysis and computational linguistics, and the skill to combine this knowledge in novel ways. COSI 129a will allow students to combine principles from multiple domains (natural language processing, machine learning, distributed systems design, parallel programming) to analyze large volumes of unstructured datasets.

Course Contents

COSI 129a will convey knowledge of the principles and practices underlying the state-of-the-art in Big Data Analysis. Initially, we will introduce big data challenges in the domain of computational linguistics as well as fundamentals of natural language processing. The course will then review available frameworks (MapReduce/Hadoop) for large-scale data collection, storage and processing, including recent advanced optimizations. It will also investigate scalable statistical machine learning techniques (e.g., clustering, classification, regression) as well as existing scalable machine learning tools (e.g., Mahout). Finally, the course will address how these mechanisms and technologies fit together to tackle natural language processing tasks on massive scale data sets.

Learning Goals

The learning objectives of this class is to:
[Systems] Introduce the state-of-the-art in scalable data management and processing. We will particularly focus on the MapReduce framework and its Hadoop implementation.
[ML] Study, practice, and implement statistical models and machine learning algorithms for the purpose of Big Data analytics.
[CL] Introduce computational linguistic techniques and tools for tackling natural language processing problems.
[Hands-on] Provide students with hands-on-experience in analyzing large volumes of unstructured data.

Covered Topics

Audience

The course is addressed to upper-level undergraduate students as well as to graduate students that have solid background in programming and computer systems organization. Students are required to have taken COSI 12b (Advanced Programming Techniques) or its equivalent.

Prerequisites

COSI 12ba or the equivalent

Required Reading

There is no required textbook for the course. The course will rely mostly on published papers and online resources. The instructors will also make available lecture notes/slides on the topics covered on class. Example of published articles to be covered include the following:
  • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, OSDI, 2004

Links

Grading

Instructors

TAs

Office hours

Name Time Location
James Pustejovsky Tu 2-3 Volen
Marc Verhagen Fr 2-3 Feldberg 106-129
Olga Papaemmanouil Tu, Fr 11-12:30 Volen
Pengyu Hong TBA Volen
Liuba Shira TBA Volen
Tuan Do Tu 5-7 Volen 111
Will Burstein Mo, We 6-7 Vertica
Jessica Lowell We 1-2, Fr 3-4 Volen 111
Long Sha Tu, We 10-11 Vertica

Tentative Lecture Schedule