Big data processing and cluster computing
General data
Course ID: | 1000-218bPDD |
Erasmus code / ISCED: |
11.3
|
Course title: | Big data processing and cluster computing |
Name in Polish: | Przetwarzanie dużych danych i programowanie na klastrach |
Organizational unit: | Faculty of Mathematics, Informatics, and Mechanics |
Course groups: |
(in Polish) Przedmioty obieralne na studiach drugiego stopnia na kierunku bioinformatyka Elective courses for Computer Science Elective courses: concurrent and distributed programming |
ECTS credit allocation (and other scores): |
6.00
|
Language: | English |
Type of course: | obligatory courses |
Short description: |
We will present techniques and tools for processing Big data sets on clusters of commodity computers. The main covered technologies are Hadoop and Spark. We will start with introducing architecture of those systems and programming models they assume like MapReduce and Resilient Distributed Dataset. Then we will cover most important algorithmic techniques and methods for analysing and comparing algorithms. Finally, we will discuss typical problems like skew and typical bottlenecks like limited reducer memory as well as methods to deal with those problems. This course will combine theory and practice. |
Full description: |
1. Hadoop Distributed File System (HDFS) 2. MapReduce model 3. Basic algorithmic techniques for MapReduce model and methods for analysing algorithms presented on typical examples (multiplying matrix by vector; multiway joins; sorting, ranking, perfect splitting; triangles counting in large graphs) - computation vs communication cost - total vs elapsed communication cost - methods for limiting reducer memory - methods for combating skew 4. Methods for effective and portable data serialization (e.g. Avro) 5. Cloud platforms: Amazon, Google, Microsoft, IBM 6. Distributed processing of large graphs (BSP and Pregel models) 7. Examples of the most important large graphs processing problems, e.g., PageRank and community detection 8. Spark and Resilient Distributed Dataset 9. Columnar data format (e.g. Parquet) 10. Spark SQL and Catalyst optimizer |
Bibliography: |
- Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010 - Mining of Massive Datasets. Anand Rajaraman, WalmartLabs, Jeffrey David Ullman, Stanford University, California - Hadoop: The Definitive Guide, 4th Edition, Storage and Analysis at Internet Scale, Tom White, O'Reilly Media, 2015 |
Learning outcomes: |
Knowledge: 1. Understands the MapReduce model and knows how to use it to solve basic problems like relational algebra operations or multiplying matrix by vector (K_W01) 2. Has knowledge about complexity of distributed algorithms and algorithms for big data processing (K_W01, K_W02) 3. Has knowledge about basic algorithmic techniques for big data processing like minimal algorithms (K_W01) 4. Has knowledge about main available cloud infrastructures (K_W06) 5. Has knowledge about techniques for data serialization (K_W01) Skills: 1. Can analyse complexity of big data algorithms and compare such algorithms, can choose right algorithm for a given use case (KU_01) 2. Can express solutions to problems in most important models of big data computation like MapReduce (KU_02, KU04) 3. Can diagnose bottlenecks in big data algorithms (KU_07) 4. Can use frameworks like Hadoop and Spark (KU_08) 5. Can serialize/deserialize data in sequential and columnar frameworks like Avro i Parquet (KU_08)6. Can configure a cluster with Hadoop and Spark (KU_08) 7. Can run processing tasks on cloud infrastructure (KU_08) 8. Can follow tutorials on big data processing topics (KU_15) Competences: 1. Knows the most important libraries with big data algorithms like Spark GraphX, Spark MLlib and Apache Mahout (K_K01) 2. Can diagnose problems and find their solutions in Internet community portals like Stack Overflow (K_K02) |
Assessment methods and assessment criteria: |
Lab is graded based on big programming assignments and points for work during the classes. To be admitted to the first term exam one needs to get at least half of the possible points from labs. Big programming assignments submitted after the deadline get a penalty or won't be graded at all if the overtime is too big. First term grade is based on labs and exam in total. Second term grade is base on exam points only. For PhD students there is an extra requirement to read and present one of current research papers on topics related to the lecture (the choice of the papers need to be accepted by the lecturer). |
Classes in period "Summer semester 2023/24" (in progress)
Time span: | 2024-02-19 - 2024-06-16 |
Navigate to timetable
MO TU W WYK
LAB
TH FR LAB
|
Type of class: |
Lab, 30 hours
Lecture, 30 hours
|
|
Coordinators: | Jacek Sroka | |
Group instructors: | Grzegorz Bokota, Jacek Sroka | |
Students list: | (inaccessible to you) | |
Examination: | Examination |
Classes in period "Summer semester 2024/25" (future)
Time span: | 2025-02-17 - 2025-06-08 |
Navigate to timetable
MO TU W TH FR |
Type of class: |
Lab, 30 hours
Lecture, 30 hours
|
|
Coordinators: | Jacek Sroka | |
Group instructors: | Grzegorz Bokota, Jacek Sroka | |
Students list: | (inaccessible to you) | |
Examination: | Examination |
Copyright by University of Warsaw.