University of Warsaw - Central Authentication System
Strona główna

Big data mining and processing

General data

Course ID: 1000-2M13DZD
Erasmus code / ISCED: 11.304 Kod klasyfikacyjny przedmiotu składa się z trzech do pięciu cyfr, przy czym trzy pierwsze oznaczają klasyfikację dziedziny wg. Listy kodów dziedzin obowiązującej w programie Socrates/Erasmus, czwarta (dotąd na ogół 0) – ewentualne uszczegółowienie informacji o dyscyplinie, piąta – stopień zaawansowania przedmiotu ustalony na podstawie roku studiów, dla którego przedmiot jest przeznaczony. / (0612) Database and network design and administration The ISCED (International Standard Classification of Education) code has been designed by UNESCO.
Course title: Big data mining and processing
Name in Polish: Eksploracja i przetwarzanie dużych zbiorów danych
Organizational unit: Faculty of Mathematics, Informatics, and Mechanics
Course groups: (in Polish) Przedmioty obieralne na studiach drugiego stopnia na kierunku bioinformatyka
Elective courses for Computer Science
Elective courses for Machine Learning
ECTS credit allocation (and other scores): 6.00 Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.

view allocation of credits
Language: English
Main fields of studies for MISMaP:

computer science
mathematics

Type of course:

elective monographs
optional courses

Prerequisites:

Big data processing and cluster computing 1000-218bPDD
Data mining 1000-2M03DM

Prerequisites (description):

Both theoretical and practical foundations of machine learning, data mining, statistical data analysis, as well as data processing and databases can significantly help in effectively acquiring knowledge during this course. The subject also extends the scope of basic subjects in artificial intelligence and big data processing, notwithstanding students can successfully improve their knowledge in this field even during the course.

Mode:

Blended learning
Classroom
Remote learning

Short description:

The subject consolidates both theoretical and practical knowledge about machine learning and data mining methods in applications related to large, heterogeneous, distributed and dynamically growing data. We discuss problems concerning reliability and quality of data in tasks of teaching effective models for classification, prediction and related applications as well as maintaining the effectiveness of such models applied as components of larger IT systems. We refer to a wide range of practical sources and shapes of data, in particular machine-generated data. We cover a wide range of practical tasks in machine learning and data analysis, e.g. anomaly detection or recognition of similarities. Based on practical examples, we discuss the full life cycle of data and information in processing and analysis systems, including properly integrated solutions based on machine learning and data analysis.

Full description:

The course topics can be divided into the following sections:

1. Overview through selected methods used in machine learning and data analysis (e.g. rule induction, feature selection, cluster analysis, and on the other hand XGBoost, SVM, various neural network architectures, etc. - we assume that such methods are already partially known to participants) in terms of problems concerning partially distributed and big data.

2. Discussion on selected methods in machine learning and data analysis in terms of understanding their results as solutions of given optimization problems, formulated on the input data. In particular, complexity of these problems in case of big data, where we tend to favour heuristic, randomized algorithms or solutions provided by artificial intelligence (evolution algorithms, simulated annealing, etc.). A more general discussion over connections between machine learning (ML) and artificial intelligence (AI) - noticing that AI and ML are not identical, but their domains are crucial to each other.

3. Discussion on typical IT systems scenarios due to the various types of data flow and related challenges for machine learning and data analysis, e.g. variants of performing local operations during data processing before registering data in fully scalable infrastructure.

4. Integration of functionalities and needs related to machine learning and data analysis methods, such as databases (both SQL and noSQL) or Business Intelligence systems. Two categories: a) ML functions called on the level of interfaces (e.g. SQL functions); b) examples of using interfaces by ML algorithms (e.g. scripts that compute ML models based on automatically generated analytical SQL queries).

5. Discussion on various approaches in data and models compactification in order to improve machine learning and data analysis performed on big data (including streaming data and high dimensionality data). Compactification may include: a) data compression and quantization; b) particular implementations of machine learning and knowledge discovery models based on approximate computing (e.g. by sampling data); c) dimensionality reduction, feature selection and extraction, simplifying models by manually reducing parameters. Moreover hybrid scenarios, based on e.g. fine-tuning hyperparameters using stricter compactification (which is faster) and learning the final model more carefully.

6. Discussion on problems related to purification of large, multi-modal, heterogeneous, multidimensional data for the purposes of machine learning and data analysis. Scenarios in which data - although big - are not suitable for learning models (e.g. lack of labels related to concepts / situations / objects we are interested in) and where we need to launch appropriate processes to make this possible (which may differ depending on the needs of expert knowledge of data labelling), e.g. related to interactive search for representative examples in data repositories. Also scenarios where errors may appear in the training data (junk data), due to e.g. measurements or labels making learning models may be less accurate.

7. Challenges related to maintaining the effectiveness of models achieved using machine learning and data analysis, seen as components of a larger IT system. Taking care of the processes of fine-tuning and training models using new data, which can be designed in a different ways due to the size of the data, dynamics of the growth of data and the speed of using and adjusting models, which is required in various business application scenarios. Diagnostics of models due to the error they make, using i.a. explanation and visualization techniques.

8. Practical application scenarios for data discovery processes, including setting analytical goals, data preprocessing and applying machine learning and data analysis methods. Examples of such implementations related to big data competitions, organized online (e.g. Knowledge Pit), including cooperation with sponsors of competition, providing feasible data for the competition (including data anonymization, maintaining data quality, connection with problem solved during competition), implementing results as prototypes of solutions that turned out to be successful in the competition, but may still require work.

Bibliography:

The course will partially be based on the course "Mining of Massive Datasets" (mmds.org). You can find useful examples, presentations and videos on the course website. The following literature is related:

1. Anand Rajaraman and Jeff Ullman: "Mining of Massive Datasets"

2. Jiawei Han and Micheline Kamber: "Data Mining, Concepts and Techniques"

3. Gregory Piatetsky-Shapiro: "KDnuggets"

4. IEEE Big Data Conferences

The latest materials from these conferences will be provided. They include, but are not limited to, articles describing machine learning competitions (e.g. organized on the Knowledge Pit platform, which can also be used as an independent source of information and data), and may also be useful for the preparation of projects by PhD candidates attending classes.

Learning outcomes:

Knowledge and skills:

-- In line with 8 main points on the topics.

Social competences:

-- Can prepare and present a report on the analysis of practical big data, where the analysis is carried out using the methods of data mining and machine learning discussed in class.

-- Can point (in a non-specialized language, aimed at potential users of analytical systems, and not necessarily experts in the field of machine learning, data mining, or the so-called data science), which big data problems (e.g. size, dimensionality, multimodality, quality and variability of data etc.) may happen during the processing and mining of specific practical dataset.

Assessment methods and assessment criteria:

As about the exercises, during the semester participants will implement a project related to the subject of the course. The project may take the form of participation in a competition related to the analysis of big datasets (e.g. on the Knowledge Pit). Projects can be carried out individually or in pairs. Each project should end with a presentation. Presentations will be given in the last week of semester (in the case of PhD candidates, an earlier presentation is possible). Presentations will be the basis for passing the exercises.

As about the lecture, the basis for passing it will be preparation of a presentation based on an arbitrary article published in the series of the IEEE Big Data conferences or an article published elsewhere, if it is related to the interests of a student and the topics of the lecture, and if it is accepted by the lecturer. Presentations will be delivered in the last month of the course.

In order to receive the final grade on the first date, the exercises and the lecture must be passed. Final grade in the second term (September) will be determined during the oral exam comprising presentation of an article (see the criteria for passing the lecture) and presentation of a finalized project (see the criteria for passing the exercises).

Classes in period "Winter semester 2023/24" (past)

Time span: 2023-10-01 - 2024-01-28
Selected timetable range:
Navigate to timetable
Type of class:
Classes, 30 hours more information
Lecture, 30 hours more information
Coordinators: Dominik Ślęzak
Group instructors: Andrzej Janusz, Dominik Ślęzak
Students list: (inaccessible to you)
Examination: Examination

Classes in period "Winter semester 2024/25" (future)

Time span: 2024-10-01 - 2025-01-26
Selected timetable range:
Navigate to timetable
Type of class:
Classes, 30 hours more information
Lecture, 30 hours more information
Coordinators: Dominik Ślęzak
Group instructors: Sebastian Stawicki, Dominik Ślęzak
Students list: (inaccessible to you)
Examination: Examination
Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)