University of Warsaw - Central Authentication System
Strona główna

Introduction to natural language processing

General data

Course ID: 3003-C3N-JK1
Erasmus code / ISCED: (unknown) / (0232) Literature and linguistics The ISCED (International Standard Classification of Education) code has been designed by UNESCO.
Course title: Introduction to natural language processing
Name in Polish: Wprowadzenie do przetwarzania języka naturalnego
Organizational unit: Institute of Polish Language
Course groups: (in Polish) Konwersatoria do wyboru dla filologii polskiej - stacjonarne 2go stopnia 2023/2024
(in Polish) Konwersatoria do wyboru dla FP - stacjonarne 2. stopnia 2023/2024 - moduł nowoczesność
(in Polish) Moduł "Nowoczesność" - filologia polska od cyklu 2019 - stacjonarne 2-go stopnia
(in Polish) Wszystkie przedmioty polonistyczne - oferta ILP (3001...) , IJP (3003...) i IPS (3007...)
ECTS credit allocation (and other scores): 7.00 Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.

view allocation of credits
Language: Polish
Type of course:

obligatory courses

Prerequisites (description):

The aim of the class will be to give participants a practical introduction to natural language processing concepts and methods. It is a practical and interdisciplinary field based on knowledge from linguistics, programming and machine learning, which has recently gained importance and recognition thanks to tools such as ChatGPT, among others.

Natural language processing enables the automated analysis of collections of texts and the creation of artificial intelligence systems based on text data (search engines, chatbots, corpus tools, etc.). In this course, we will take a crash course in Python programming and learn about the practical side of packages that enable automated text analysis, including spaCy, StyloMetrix, BERTopic, and others, e.g. for creating statistical summaries and visualising the results obtained.

Mode:

Classroom

Short description:

The aim of the course will be to give participants a practical introduction to natural language processing, computational linguistics and programming, in particular to process text corpora using the natural language processing techniques available in the Python programming language.

Participants are not required to have prior knowledge of programming languages and programming skills.

Full description:

The aim of the course will be to give participants a practical introduction to natural language processing, computational linguistics and programming, in particular - processing text corpora using natural language processing techniques available in the Python programming language.

Participants are not required to have prior knowledge of programming languages and programming skills, but are expected to have the motivation and commitment needed to acquire programming skills in natural language processing.

Topics defining the scope of the course:

1. Basics of programming in Python: variable types, data structures, conditions and loops, functions and classes, working with files and using packages.

2. Application of Python for text data collection and processing (scraping, API querying, OCR and audio transcription).

3. SpaCy and different levels of linguistic annotation: morpho-syntactic analysis and tagging, dependency parsing.

4. Vector semantics and language models.

5. Models for sequence classification and token classification in spaCy.

6. Search through text using spaCy: rule-based and layer-based annotation search, semantic search.

7. Stylometric analysis of texts using StyloMetrix, pandas and scikit learn.

8. Topic modelling using BERTopic.

9. Visualisation of corpus processing results.

Bibliography:

Altinuk, D. (2021). Mastering spaCy: An end-to-end practical guide to implementing NLP applications using the Python ecosystem. Birmingham: Packt Publishing.

Hobson, L., Cole, H., Hannes, H. (2021). Przetwarzanie języka naturalnego w akcji. Rozumienie, analiza i generowanie tekstu w Pythonie na przykładzie języka angielskiego. Warszawa: PWN.

Mattingly, W. (2022). Introduction to Python for Digital Humanities, 2022, URL: www.python-textbook.pythonhumanities.com.

Mattingly, W. (2021). Introduction to spaCy 3, URL: www.spacy.pythonhumanities.com.

Sweigart, A. (2020). Automatyzacja nudnych zadań z Pythonem. Nauka programowania. Gliwice: Helion.

Learning outcomes:

Student

- is familiar with the tools for text data processing and analysis available in the Python language

- knows the basics of programming in Python and Python packages for text data processing and analysis

- knows the most important concepts and techniques of natural language processing

- is able to analyse a text data corpus with the use of Python language packages

- is able to formulate a hypothesis concerning a text corpus and verify it using natural language processing techniques

- is able to visualise the results of a text corpus analysis

- is able to critically evaluate information on artificial intelligence systems based on text data

- is able to understand the importance of natural language processing in solving both theoretical and practical problems and to apply the methods of this field to achieve their own research goals

Assessment methods and assessment criteria:

Attendance in class (two absences allowed).

Regularly solving programming and natural language processing tasks.

Completion of a small individual or group project using natural language processing methods.

Classes in period "Summer semester 2023/24" (in progress)

Time span: 2024-02-19 - 2024-06-16
Selected timetable range:
Navigate to timetable
Type of class:
Seminar, 30 hours, 15 places more information
Coordinators: Marcin Będkowski, Iwona Burkacka
Group instructors: Marcin Będkowski
Students list: (inaccessible to you)
Examination: Course - Grading
Seminar - Grading
Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)