University of Warsaw - Central Authentication System
Strona główna

From corpus to model. Methods of text collection, processing, and annotation

General data

Course ID: 1500-SZD-OKDM
Erasmus code / ISCED: (unknown) / (unknown)
Course title: From corpus to model. Methods of text collection, processing, and annotation
Name in Polish: Od korpusu do modelu. Metody zbierania, przetwarzania i anotacji tekstów
Organizational unit: Faculty of Polish Studies
Course groups:
ECTS credit allocation (and other scores): (not available) Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.
Language: Polish
Type of course:

elective courses

Short description:

The main goal of the course will be to practically introduce the participants to issues concerning computational linguistics and natural language processing, in particular – the creation and processing of text corpora with custom annotation layers.

The participants will learn and go through the full cycle of work on text data preparation: starting from its collection, through programmatic processing, manual annotation, up to the training of a machine learning model enabling automated annotation of further text resources.

Full description:

The main goal of the course will be to provide participants with a practical introduction to computational linguistics and natural language processing, which will allow them to acquire the skills of creating and processing text corpora with custom annotation layers. Corpus annotation consists in enriching a set of raw text data with additional linguistic information, e.g., description of particular text segments in terms of their part-of-speech attribution. Existing corpus tools allow automatic annotation of text data in terms of inflection and syntax, and sometimes also in terms of semantics and pragmatics. The ability to create their own annotation layers can allow participants to overcome the natural limitations of commonly available tools due to their general purpose and prepare a tailor-made solution for their research projects.

Within the first block of classes, participants learn the basics of programming in Python, its packages for processing natural language and text data (e.g., spaCy), as well as datasets used to create tools for building text corpora and automatically analysing and annotating them (such as Korpusomat or SketchEngine).

The second thematic block will be devoted to the methodology and practice of text annotation, i.e., to the creation of custom data resources, and to machine learning techniques (e.g., machine and deep learning, supervised and unsupervised, different variations of text classification tasks). The resources prepared by the participants – in the Label Studio – will be used to train their own machine learning models, based on state-of-the-art neural network architectures that allow for relatively high performance on small training sets.

Text corpora are a natural basis for quantitative and qualitative research in linguistics, literary studies, and sociology. However, the use of corpora is not limited to these fields – they can also be useful for researchers belonging to other disciplines, e.g., for creating corpora of articles from a given research discipline and literature review.

The methods and examples presented during the classes will concern mainly the Polish language, however, in general, participants will be able to find their equivalents or transfer them to other languages (e.g., English, German, French or Italian) without much difficulty. Participants are not required to have prior knowledge of programming languages or programming skills.

Bibliography:

Hobson, L., Cole, H., Hannes, H. (2021) Przetwarzanie języka naturalnego w akcji. Rozumienie, analiza i generowanie tekstu w Pythonie na przykładzie języka angielskiego. Warszawa: PWN.

Pustejovsky, J., Stubbs, A. (2013) Natural Language Annotation for Machine Learning. Sebastopol, CA: O’Reilly Media.

Sweigart, A. (2020) Automatyzacja nudnych zadań z Pythonem. Nauka programowania. Gliwice: Helion.

Altinuk, D. (2021) Mastering spaCy: An end-to-end practical guide to implementing NLP applications using the Python ecosystem. Birmingham: Packt Publishing.

Kinsley, H., Kukieła, D. (2020) Neural Networks from Scratch in Python. Sentdex, Kinsley Enterprises.

Learning outcomes:

Student

P8S_WG has advanced knowledge of selected tools for text data processing and analysis

P8S_WG knows the basics of programming in Python and packages of this language for processing and analysing text data

P8S_WG knows the most important text resources and their use in existing IT corpus tools

P8S_WG knows the most important concepts and techniques of natural language processing and machine learning

Skills:

Student

P8S_UW is able to acquire and process text data for qualitative and quantitative analyses using programming techniques

P8S_UW is able to annotate a set of texts, taking into account methodological requirements and consequences for the efficiency of trained machine learning models

P8S_UK is able to communicate on specialist topics in natural language processing to a degree that enables active participation in the international scientific community

Social competences:

Student

P8S_KK is willing to acknowledge the importance of knowledge in the field of natural language processing in solving both cognitive and practical problems and to apply methods of natural language processing to achieve own research goals

Assessment methods and assessment criteria:

Class attendance (two absences permitted)

Completion of a small individual or group project involving the collection and annotation of data (creation of a microcorpus) and training a machine learning model

This course is not currently offered.
Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)