University of Warsaw - Central Authentication System
Strona główna

Experimental semantics - corpus analysis module

General data

Course ID: 3501-KOG-SE-MAK
Erasmus code / ISCED: 08.1 Kod klasyfikacyjny przedmiotu składa się z trzech do pięciu cyfr, przy czym trzy pierwsze oznaczają klasyfikację dziedziny wg. Listy kodów dziedzin obowiązującej w programie Socrates/Erasmus, czwarta (dotąd na ogół 0) – ewentualne uszczegółowienie informacji o dyscyplinie, piąta – stopień zaawansowania przedmiotu ustalony na podstawie roku studiów, dla którego przedmiot jest przeznaczony. / (0223) Philosophy and ethics The ISCED (International Standard Classification of Education) code has been designed by UNESCO.
Course title: Experimental semantics - corpus analysis module
Name in Polish: Semantyka eksperymentalna - moduł analizy korpusowej
Organizational unit: Institute of Philosophy
Course groups:
ECTS credit allocation (and other scores): (not available) Basic information on ECTS credits allocation principles:
  • the annual hourly workload of the student’s work required to achieve the expected learning outcomes for a given stage is 1500-1800h, corresponding to 60 ECTS;
  • the student’s weekly hourly workload is 45 h;
  • 1 ECTS point corresponds to 25-30 hours of student work needed to achieve the assumed learning outcomes;
  • weekly student workload necessary to achieve the assumed learning outcomes allows to obtain 1.5 ECTS;
  • work required to pass the course, which has been assigned 3 ECTS, constitutes 10% of the semester student load.

view allocation of credits
Language: Polish
Type of course:

elective courses

Mode:

Remote learning

Short description:

The main aim of the module is to introduce the students to available tools for analyzing text corpora. In experimental semantics data from the corpora can serve several functions. It can not only be an evidence for or agains certain semantic hypothesis, but also comprise a source of experimental materials or provide inspiration for new hypotheses. The students will learn about structure of typical text korpus and basic assumptions behind corpus linguistics. During the course basic association measures will be introduced (t, χ2 , MI, logDice) and how they can be used to test hypotheses about co-occurence of certain semantic units in real linguistic material.

Full description:

1. Corpora and search engines

Lesson 1. Available English na Polish corpora

- discussion on the structure of the following corpora: NKJP, BNC and COCA

- text structure of the corpora (balance od types and sources of texts)

- additional informations in corpora (metadata, tagsets etc.)

- available search engines

- practical aim: a student can search for words and phrases using different search engines for Polish and English corpora

Lesson 2. Advanced features of search engines. Corpus Query Language

- syntax of CQL

- regular expressions

- searching using metadata

- practical aim: a student can construct complex search query using regular expressions and metadata

2. Collocations

Lesson 1. Association measures

- t-score

- χ2

- Mutual Information

- logDice

- statistical hypothesis testing using association measures

- practical aim: a student can, provided with frequency data of the words, calculate and interpret different association measures, she can understand practical and theoretical differences among them, she can also use them to statistically test hypotheses on co-occurence of particular semantic units

Lesson 2. Association measures in corpus search engines

- using corpus search engines to compute association statistics and extract frequency data

- practical aim: a student can use available tools to compute association statistics, in addition to that, when certain measurment is not available, the student can extract frequency data and compute given statistic herself

3. SketchEngine

Lesson 1. What is SketchEngine and what it can do for you?

- discussion on corpora available in SketchEngine

- searching and saving results of the queries

- association measures available in SketchEngine

- WordSketches

- parallel corpora

- practical aim: a student can use her knowledge and skills from previous lessons in work with SketchEngine

4. WordNets

Lesson 1. WordNet and Słowosieć

- structure of WordNet - different semantic relations between semantic units

- using WordNets in combination with corpora

- practical aim: a student can use informations from WordNets in her work with text corpora

5. Scripting your work with corpora (for volunteers)

Lesson 1. Access to SketchEngine using Python

- discussion on SketchEngine API

- short introduction to JSON data format and simplejson Python library

- practical aim: a student can access whole functionality of SketchEngine using Python

Bibliography: (in Polish)

- Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing (Vol. 999). Cambridge: MIT press.

- Davies, M. (2007). Semantically-based queries with a joint BNC/WordNet database. Language and Computers, 62(1), 149-167.

- Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41.

- Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International journal of lexicography, 3(4), 235-244.

- Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). Itri-04-08 the sketch engine. Information Technology, 105, 116.

- Lewandowska-Tomaszczyk, B., Bańko, M., Górski, R. L., Pęzik, P., & Przepiórkowski, A. (2012). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN.

- Maziarz, M., Piasecki, M., Rudnicka, E., & Szpakowicz, S. (2014). Plwordnet as the cornerstone of a toolkit of lexico-semantic resources. In Proceedings of the Seventh Global Wordnet Conference (pp. 304-312).

- https://www.sketchengine.co.uk/user-guide/user-manual/word-sketch/

Learning outcomes: (in Polish)

Nabyta wiedza:

- student zna zasoby korpusowe dostępne w sieci

- student zna różne rodzaje korpusów i wie, do jakich celów można je wykorzystać

- student zna podstawowe pojęcia i terminologię z zakresu lingwistyki korpusowej

- student zna wybrane narzędzia korpusowe i wie, jak je stosować

Nabyte umiejętności:

- student umie posługiwać się wybranymi narzędziami służącymi do analizy korpusów

- student umie zanalizować wyniki danych korpusowych

- student potrafi korzystać z wybranego środowiska pracy dedykowanego korpusom

Nabyte kompetencje społeczne:

- student potrafi współpracować w zespole badawczym, korzystając z narzędzi komunikacji cyfrowej

Assessment methods and assessment criteria:

For every week there will be an assignment for the students (7 assignments total, 10 points max for each assigment). Final mark depends only on succesfully completing assignments.

0-35 - 2

35-50 - 3

50-60 - 4

60-70 - 5

This course is not currently offered.
Course descriptions are protected by copyright.
Copyright by University of Warsaw.
Krakowskie Przedmieście 26/28
00-927 Warszawa
tel: +48 22 55 20 000 https://uw.edu.pl/
contact accessibility statement USOSweb 7.0.3.0 (2024-03-22)