Introduction
The amount of data produced across the globe has been increasing and will continue to grow at an accelerating rate for the foreseeable future. At companies across all industries, servers are overflowing with usage logs, message streams, transaction records, sensor data, business operations records and mobile device data. Effectively analyzing these huge collections of data, “big data” as it is commonly known, can create significant value creating strong demand for experts with ability to carry out such analysis.
Effective big data analysis requires skills in a range of computer science areas such as data storage and processing, statistical data analysis and computational linguistics, and the skill to combine this knowledge in novel ways. COSI 129a will allow students to combine principles from multiple domains (natural language processing, machine learning, distributed systems design, parallel programming) to analyze large volumes of unstructured datasets.
Course Contents
COSI 129a will convey knowledge of the principles and practices underlying the state-of-the-art in Big Data Analysis. Initially, we will introduce big data challenges in the domain of computational linguistics as well as fundamentals of natural language processing. The course will then review available frameworks (MapReduce/Hadoop) for large-scale data collection, storage and processing, including recent advanced optimizations. It will also investigate scalable statistical machine learning techniques (e.g., clustering, classification, regression) as well as existing scalable machine learning tools (e.g., Mahout). Finally, the course will address how these mechanisms and technologies fit together to tackle natural language processing tasks on massive scale data sets.
Learning Goals
The learning objectives of this class is to:
[Systems] Introduce the state-of-the-art in scalable data management and processing. We will particularly focus on the MapReduce framework and its Hadoop implementation.
[ML] Study, practice, and implement statistical models and machine learning algorithms for the purpose of Big Data analytics.
[CL] Introduce computational linguistic techniques and tools for tackling natural language processing problems.
[Hands-on] Provide students with hands-on-experience in analyzing large volumes of unstructured data.
[Systems] Introduce the state-of-the-art in scalable data management and processing. We will particularly focus on the MapReduce framework and its Hadoop implementation.
[ML] Study, practice, and implement statistical models and machine learning algorithms for the purpose of Big Data analytics.
[CL] Introduce computational linguistic techniques and tools for tackling natural language processing problems.
[Hands-on] Provide students with hands-on-experience in analyzing large volumes of unstructured data.
Covered Topics
- Basics in NLP & available tools for NLP
- The MapReduce programming framework
- The Hadoop Implementation
- Scalable statistical analysis & machine learning techniques
- Storage/Cleaning/SW Practices
Audience
The course is addressed to upper-level undergraduate students as well as to graduate students that have solid background in programming and computer systems organization. Students are required to have taken COSI 12b (Advanced Programming Techniques) or its equivalent.
Prerequisites
COSI 12ba or the equivalent
Required Reading
There is no required textbook for the course. The course will rely mostly on published papers and online resources. The instructors will also make available lecture notes/slides on the topics covered on class. Example of published articles to be covered include the following:
- MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, OSDI, 2004
Links
Grading
- 70% Programming Project (10% for A0 and 15% for A1-A4)
- 30% Quizes (3 x 10%)
Instructors
- James Pustejovsky (main instructor) [jamesp (at) cs (dot) brandeis (dot) edu]
- Olga Papaemmanouil [olga (at) cs (dot) brandeis (dot) edu]
- Pengyu Hong [hongpeng (at) brandeis (dot) edu]
- Liuba Shrira [liuba (at) brandeis (dot) edu]
- Marc Verhagen [marc (at) cs (dot) brandeis (dot) edu]
TAs
- Tuan Do [tuandn (at) brandeis (dot) edu]
- Will Burstein [wburstein (at) brandeis (dot) edu]
- Jessica Lowell [jessiehl (at) brandeis (dot) edu]
- Long Sha [longsha (at) brandeis (dot) edu]
Office hours
Name | Time | Location |
---|---|---|
James Pustejovsky | Tu 2-3 | Volen |
Marc Verhagen | Fr 2-3 | Feldberg 106-129 |
Olga Papaemmanouil | Tu, Fr 11-12:30 | Volen |
Pengyu Hong | TBA | Volen |
Liuba Shira | TBA | Volen |
Tuan Do | Tu 5-7 | Volen 111 |
Will Burstein | Mo, We 6-7 | Vertica |
Jessica Lowell | We 1-2, Fr 3-4 | Volen 111 |
Long Sha | Tu, We 10-11 | Vertica |