University of Warsaw - Central Authentication System
Strona główna

(in Polish) Data engineering

General data

Course ID: 1000-2M23DE
Erasmus code / ISCED: 11.3 Kod klasyfikacyjny przedmiotu składa się z trzech do pięciu cyfr, przy czym trzy pierwsze oznaczają klasyfikację dziedziny wg. Listy kodów dziedzin obowiązującej w programie Socrates/Erasmus, czwarta (dotąd na ogół 0) – ewentualne uszczegółowienie informacji o dyscyplinie, piąta – stopień zaawansowania przedmiotu ustalony na podstawie roku studiów, dla którego przedmiot jest przeznaczony. / (0612) Database and network design and administration The ISCED (International Standard Classification of Education) code has been designed by UNESCO.
Course title: (unknown)
Name in Polish: Data engineering
Organizational unit: Faculty of Mathematics, Informatics, and Mechanics
Course groups: (in Polish) Przedmioty 4EU+ (z oferty jednostek dydaktycznych)
Elective courses for Computer Science
Elective courses for Machine Learning
ECTS credit allocation (and other scores): 6.00 Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.

view allocation of credits
Language: English
Type of course:

elective monographs

Short description:

Overview of the data processing pipeline; collection and storage of raw data; processing, cleaning, and storage of processed data; scaling tools for the data processing system.

Full description:

The course will go from the basics of the data engineering task and show what is different about such a system from something like a personal blog or e-market platform. Shortly defining areas where data engineering approaches make sense. After this will give an overview of the file formats and why it important, and tries to show the decomposition of the general idea of the database. Describing a way to store data in the system. Demonstrating how to implement some processing tasks in the context of the data pipeline and giving tooling on how to conduct or orchestrate independent tasks into a single pipeline. In addition, will be describing tools

such as queues to conduct different elements of the data engineering system with each other as well as with elements outside it.

1. Introduction, MAD, MDS, Data Engineering life cycle, sources of information and self-education

2. Evolution of Data Engineering, Lambda architecture, KAPPA, cloud native, storage and computer separation

3. Source system

4. Data modelling, transformation, DAG, Spark

5. Data warehouse, data lake, lake house

6. Data governance, Data Hub

7. Streams vs queues, Spark, Pulsar

8. Decomposition, orchestrations, Prefect

9. Consumers, Superset

10. Quality, security, observability

11. Data Engineering architecture and with whom we work

12. Project demo

13. Summary

Bibliography:

1. Designing Data-Intensive Applications. Must read(even reread) book.

2. Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way book to get structural knowledge about the tools' family of data bricks.

3. Kafka in Action For one who doesn’t want to read the docs. Will be out of date in 1-2 years, but now it is good to get intuition.

4. The Log: What every software engineer should know about real-time data unifying abstraction must read the article (yep, it is ok that it is from 2013) and a good blog to read in general https://engineering.linkedin.com/blog/topic/distributed-systems.

5. How to beat the CAP theorem.

6. Questioning the Lambda Architecture.

7. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.

8. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.

9. Towards Data Science – as a source of some news, good for beginners.

10. https://medium.com/the-prefect-blog lot of articles that are good to read for beginners (i.e. https://medium.com/the-prefect-blog/are-you-an-accidental-data-engineer-6b60e0f51286 can skip everything, which related to Prefect directly)

Learning outcomes:

Understanding the basic principles of most data processing tasks & the mechanics of modern tools

Assessment methods and assessment criteria:

- Lab projects.

- If some LMS is used – the topic assessments with peer review. assessments included.

- Final project.

Classes in period "Winter semester 2023/24" (past)

Time span: 2023-10-01 - 2024-01-28
Selected timetable range:
Navigate to timetable
Type of class:
Lab, 30 hours more information
Lecture, 30 hours more information
Coordinators: Yura Braiko
Group instructors: Yura Braiko
Students list: (inaccessible to you)
Examination: Examination
Notes: (in Polish)

Zajęcia są prowadzone w języku angielskim i w sposób zdalny.

Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)