The Macedonian Spoken Corpus has been developed as part of the SNF-project ‘Ill-bred sons’, family and friends: tracing the multiple affiliations of Balkan Slavic at the Department of Slavonic Literatures and Linguistics, University of Zurich. The open-access corpus is designed for anyone with an interest in studying the Macedonian language and its dialects.
The corpus comprises transcriptions of audio files collected during a series of field research trips to the Prespa, Bitola and Debar regions in 2012, 2014, 2016 and 2019. Most texts were gathered in semi-directed interviews based on a questionnaire about local traditions (weddings, celebration of calendar holidays etc.), local mythology and folklore, as well as the informants’ biographical stories.
In addition, the corpus contains several texts published earlier by Macedonian dialectologists (Popvasilieva subcorpus) and the ’Bombi’ subcorpus representing the modern urban variety of Skopje.
The informants belong to different religious communities (Islam, Orthodox Christianity) and a number of them are bilingual (Aromanian-Macedonian or Albanian-Macedonian). The speech of some of the informants can be characterized as dialectal, while others rather use a regional standard variety, i. e. predominantly standard Macedonian with individual dialectal features. The informants are anonymized.
The corpus is lemmatized and annotated for part-of-speech with minimal morphological information (in pronouns, nouns and verbs).
Anastasia Escher: software, field research, transcription, annotation.
Olivier Winistörfer: field research, OCR, annotation.
Giulia Morra: OCR.
ID | Year | Dialect | Birth year | Place | Tokens | Themes | Religion | L1 |
T1 | 2016 | Prespa | 1956 | Resen | 2848 | childhood | Orthodox | Mac |
T2 | 2016 | Prespa | 1956 | Resen | 452 | biography | Orthodox | Mac |
T3 | 2012 | Reka | 1948 | Janche | 1030 | vampires | Islam | Mac |
T4 | 2012 | Reka | 1948 | Janche | 1402 | wedding | Islam | Mac |
T5 | 2012 | Reka | 1948 | Janche | 1537 |
wedding biography |
Islam | Mac |
T6 | 2012 | Reka | 1948 | Janche | 1121 |
wedding holidays |
Islam | Mac |
T7 | 2012 | Reka | 1948 | Janche | 1461 | family | Islam | Mac |
T8 | 2016 | Prespa | 1963 | Krani | 1083 | fairy tale | Islam | Alb |
T9 | 2014 | Prespa | 1956 | Resen | 889 | proverbs | Orthodox | Mac |
T10 | 2014 | Prespa | 1956 | Resen | 855 |
proverbs sayings |
Orthodox | Mac |
T11 | 2014 | Prespa | 1956 | Resen | 733 | biography | Orthodox | Mac |
T12 | 2014 | Prespa | 1956 | Resen | 677 | biography | Orthodox | Mac |
T13 | 2014 | Prespa | 1939 | Krani | 370 | vampires | Islam | Alb |
T14 | 2014 | Prespa | 1949 | Arvati | 1578 | biography | Orthodox | Mac |
T15 | 2012 | Reka | 1948 | Janche | 1632 | family | Islam | Mac |
T16 | 2012 | Reka | 1948 | Janche | 1250 | wedding | Islam | Mac |
T17 | 2013 | Reka | 1949 | Janche | 1077 | wedding | Islam | Mac |
T18 | 2014 | Prespa | 1950 | Resen | 290 | biography | Orthodox | Arom |
T19 | 2014 | Prespa | 1950 | Resen | 114 | biography | Orthodox | Arom |
T20 | 2015 | Prespa | 1950 | Resen | 178 | biography | Orthodox | Arom |
In addition to the data collected during field research trips to Western Macedonia, the Macedonian Spoken Corpus also includes data from the variety spoken in Skopje (the capital of North Macedonia) which may be characterized as a supra-dialectal standard.
While it is based on the standard language, it has also accumulated traits from a diversity of regional varieties as a result of the continuous migration to the capital.
Besides, it contains a number of Serbo-Croatian features mostly in the lexical domain, as it was dominant in Macedonia for a long time and was taught in schools as an obligatory subject. In the Macedonian Spoken Corpus, this variety is represented by the Bombi subcorpus.
The Bombi subcorpus contains transcripts of wiretapped conversations of the members of the Macedonian political elite connected with the then-ruling party VMRO-DPMNE.
Victor Friedman describes these transcriptions as follows:
As  dialectal transcriptions, the Bombi-subcorpus is available in XML-format and at the search page.
The Popvasileva subcorpus contains texts collected by Alexandra Popvasileva for her PhD disseration «Bilingual storytelling of folk fairy tales» (Mac. «Двојазичното раскажување на народни приказни (влашко-македонски и македонско-влашки релации)») which she defended in Ljubljana in 1983 and publishedin Skopje in 1987.
The author of the dissertation analyzes the complex sociolinguistic situation in the Macedonian town of Krushevo, one of the most significant Aromanian centers in the country, based on a collection of bilingual storytelling.
The informants were native speakers of Aromanian and most of them were born at the beginning of the 20th century. They were asked to tell folk tales both in Aromanian and in Macedonian, their second language (L2).
These texts were published in the dissertation and they represent an invaluable source for research of both cultural anthropology and language contact. As  other dialect corpus transcriptions in the Macedonian Spoken Corpus, the tales from the Popvasileva subcorpus are available both in the searching database and in the TEI XML format:
The corpus is annotated for part-of-speech (PoS), lemma and minimal morphological information (only the noun gender and verbal aspect are marked).
The PoS-annotation is based, with minor changes, on the MULTEXT-East morphosyntactic specifications which define categories and their morphosyntactic features.
Only personal pronouns are marked with a complete tag marking all of their morphological categories:
The morphological annotation of the tokens has been simplified in the following way:
PoS tag | Full tag | Meaning | Example |
N | Nm | N-noun, m-masculin | човек 'human, man', град 'town' |
N | Nf | N-noun, f-feminine | девојка 'girl', вода 'water' |
N | Np | N-noun, p-proper | Италија 'Italy', Оливер 'Olivier' |
A | A | A-adjective | убав 'beautiful', ладен 'cold' |
P | Pp1-sn | P-pronoun, p-personal, 1-first person, (no gender), s-singular, n-nominaive |
јас 'I' |
P | Pp1-sd | P-pronoun, p-personal, 1-first person, (no gender), s-singular, d-dative |
ми '(to) me' |
P | Pp1-sa | P-pronoun, p-personal, 1-first person, (no gender), s-singular, a-accusative |
ме, мене 'me' |
P | Pp2-sn | P-pronoun, p-personal, 2-second person, (no gender), s-singular, n-nominative |
ти 'you' |
P | Pp2-sd | P-pronoun, p-personal, 2-second person, (no gender), s-singular, d-dative |
ти '(to) you' |
P | Pp2-sa | P-pronoun, p-personal, 2-second person, (no gender), s-singular, a-accusative |
те, тебе 'you (direct object)' |
P | Pp3msn | P-pronoun, p-personal, 3-third person, m-masculine, s-singular, n-nominative |
тој 'he' |
P | Pp3msd | P-pronoun, p-personal, 3-third person, m-masculine, s-singular, d-dative |
му, нему '(to) him' |
P | Pp3msa | P-pronoun, p-personal, 3-third person, m-masculine, s-singular, a-accusative |
го, него 'him' |
P | Pp3fsn | P-pronoun, p-personal, 3-third person, f-feminine, s-singular, n-nominative |
таа 'she' |
P | Pp3fsd | P-pronoun, p-personal, 3-third person, f-feminine, s-singular, d-dative |
ѝ, нејзе '(to) her' |
P | Pp3fsa | P-pronoun, p-personal, 3-third person, f-feminine, s-singular, a-accusative |
ја, неа 'her' |
P | Pp3nsn | P-pronoun, p-personal, 3-third person, n-neuter, s-singular, n-nominative |
тоа 'it' |
P | Pp1-pn | P-pronoun, p-personal, 1-first person, (no gender), p-plural, n-nominative |
ние 'we' |
P | Pp1-pd | P-pronoun, p-personal, 1-first person, (no gender), p-plural, d-dative |
нам '(to) us' |
P | Pp1-pa | P-pronoun, p-personal, 1-first person, (no gender), p-plural, a-accusative |
нас 'us' |
P | Pp2-pn | P-pronoun, p-personal, 2-second person, (no gender), p-plural, n-nominative |
вие 'you (pl.)' |
P | Pp2-pd | P-pronoun, p-personal, 2-second person, (no gender), p-plural, d-dative |
вам '(to) you (pl.)' |
P | Pp2-pa | P-pronoun, p-personal, 2-second person, (no gender), p-plural, a-accusative |
вас 'you (pl., object form)' |
V | Vi | V-verb, i-imperfective | чека 'wait' |
V | Vp | V-verb, p-perfective | рече 'say' |
V | Vb | V-verb, b-biaspectual | анализира 'analyse' |
R | R | R-adverb | многу 'much' |
S | S | S-adposition | за 'for', на 'on' |
S | C | C-conjunction | дека 'that', кога 'when' |
M | M | M-numeral | два 'two', трети 'third' |
J | J | J-interjection | о!, е! |
Z | Z | Z-punctuation | .,!?:, |
Macedonian Spoken corpus is being created as a part of SNF-project ‘Ill-bred sons’, family and friends: tracing the multiple affiliations of Balkan Slavic at the Department of Slavonic languages and literatures of the University of Zurich. The corpus is available for use by all who are interested in studying Macedonian language and dialects. If corpus data are used in a publication, please provide a reference to the online-resource. The collection if the annotated texts is being constantly expanded.
Since the corpus is currently in a development phase, the annotation of the transcripts may contain errors. We would be grateful for any comment or suggestion of the improvement of the linguistic content of the resource.
Escher, Anastasia; Winistörfer, Olivier; (eds., 2021). Macedonian Spoken Corpus. Zürich: UZH Institute of Slavic Studies. Available online at (last access: )
The photos for the home page header are taken from the storage of licence free pictures Pixabay.