University of Warsaw - Central Authentication System
Strona główna

Big data processing and cluster computing

General data

Course ID: 1000-218bPDD
Erasmus code / ISCED: 11.3 Kod klasyfikacyjny przedmiotu składa się z trzech do pięciu cyfr, przy czym trzy pierwsze oznaczają klasyfikację dziedziny wg. Listy kodów dziedzin obowiązującej w programie Socrates/Erasmus, czwarta (dotąd na ogół 0) – ewentualne uszczegółowienie informacji o dyscyplinie, piąta – stopień zaawansowania przedmiotu ustalony na podstawie roku studiów, dla którego przedmiot jest przeznaczony. / (0612) Database and network design and administration The ISCED (International Standard Classification of Education) code has been designed by UNESCO.
Course title: Big data processing and cluster computing
Name in Polish: Przetwarzanie dużych danych i programowanie na klastrach
Organizational unit: Faculty of Mathematics, Informatics, and Mechanics
Course groups: (in Polish) Przedmioty obieralne na studiach drugiego stopnia na kierunku bioinformatyka
Elective courses for Computer Science
Elective courses: concurrent and distributed programming
ECTS credit allocation (and other scores): 6.00 Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.

view allocation of credits
Language: English
Type of course:

obligatory courses

Short description:

We will present techniques and tools for processing Big data sets on clusters of commodity computers. The main covered technologies are Hadoop and Spark. We will start with introducing architecture of those systems and programming models they assume like MapReduce and Resilient Distributed Dataset. Then we will cover most important algorithmic techniques and methods for analysing and comparing algorithms. Finally, we will discuss typical problems like skew and typical bottlenecks like limited reducer memory as well as methods to deal with those problems. This course will combine theory and practice.

Full description:

1. Hadoop Distributed File System (HDFS)

2. MapReduce model

3. Basic algorithmic techniques for MapReduce model and methods for analysing algorithms presented on typical examples (multiplying matrix by vector; multiway joins; sorting, ranking, perfect splitting; triangles counting in large graphs)

- computation vs communication cost

- total vs elapsed communication cost

- methods for limiting reducer memory

- methods for combating skew

4. Methods for effective and portable data serialization (e.g. Avro)

5. Cloud platforms: Amazon, Google, Microsoft, IBM

6. Distributed processing of large graphs (BSP and Pregel models)

7. Examples of the most important large graphs processing problems, e.g., PageRank and community detection

8. Spark and Resilient Distributed Dataset

9. Columnar data format (e.g. Parquet)

10. Spark SQL and Catalyst optimizer

Bibliography:

- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010

- Mining of Massive Datasets. Anand Rajaraman, WalmartLabs, Jeffrey David Ullman, Stanford University, California

- Hadoop: The Definitive Guide, 4th Edition, Storage and Analysis at Internet Scale, Tom White, O'Reilly Media, 2015

Learning outcomes:

Knowledge:

1. Understands the MapReduce model and knows how to use it to solve basic problems like relational algebra operations or multiplying matrix by vector (K_W01)

2. Has knowledge about complexity of distributed algorithms and algorithms for big data processing (K_W01, K_W02)

3. Has knowledge about basic algorithmic techniques for big data processing like minimal algorithms (K_W01)

4. Has knowledge about main available cloud infrastructures (K_W06)

5. Has knowledge about techniques for data serialization (K_W01)

Skills:

1. Can analyse complexity of big data algorithms and compare such algorithms, can choose right algorithm for a given use case (KU_01)

2. Can express solutions to problems in most important models of big data computation like MapReduce (KU_02, KU04)

3. Can diagnose bottlenecks in big data algorithms (KU_07)

4. Can use frameworks like Hadoop and Spark (KU_08)

5. Can serialize/deserialize data in sequential and columnar frameworks like Avro i Parquet (KU_08)6. Can configure a cluster with Hadoop and Spark (KU_08)

7. Can run processing tasks on cloud infrastructure (KU_08)

8. Can follow tutorials on big data processing topics (KU_15)

Competences:

1. Knows the most important libraries with big data algorithms like Spark GraphX, Spark MLlib and Apache Mahout (K_K01)

2. Can diagnose problems and find their solutions in Internet community portals like Stack Overflow (K_K02)

Assessment methods and assessment criteria:

Lab is graded based on big programming assignments and points for work during the classes. To be admitted to the first term exam one needs to get at least half of the possible points from labs. Big programming assignments submitted after the deadline get a penalty or won't be graded at all if the overtime is too big. First term grade is based on labs and exam in total. Second term grade is base on exam points only.

For PhD students there is an extra requirement to read and present one of current research papers on topics related to the lecture (the choice of the papers need to be accepted by the lecturer).

Classes in period "Summer semester 2023/24" (in progress)

Time span: 2024-02-19 - 2024-06-16
Selected timetable range:
Navigate to timetable
Type of class:
Lab, 30 hours more information
Lecture, 30 hours more information
Coordinators: Jacek Sroka
Group instructors: Grzegorz Bokota, Jacek Sroka
Students list: (inaccessible to you)
Examination: Examination

Classes in period "Summer semester 2024/25" (future)

Time span: 2025-02-17 - 2025-06-08
Selected timetable range:
Navigate to timetable
Type of class:
Lab, 30 hours more information
Lecture, 30 hours more information
Coordinators: Jacek Sroka
Group instructors: Grzegorz Bokota, Jacek Sroka
Students list: (inaccessible to you)
Examination: Examination
Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)