Corpus content

The Macedonian Spoken Corpus has been developed as part of the SNF-project ‘Ill-bred sons’, family and friends: tracing the multiple affiliations of Balkan Slavic at the Department of Slavonic Literatures and Linguistics, University of Zurich. The open-access corpus is designed for anyone with an interest in studying the Macedonian language and its dialects.

The corpus comprises transcriptions of audio files collected during a series of field research trips to the Prespa, Bitola and Debar regions in 2012, 2014, 2016 and 2019. Most texts were gathered in semi-directed interviews based on a questionnaire about local traditions (weddings, celebration of calendar holidays etc.), local mythology and folklore, as well as the informants’ biographical stories.

In addition, the corpus contains several texts published earlier by Macedonian dialectologists (Popvasilieva subcorpus) and the ’Bombi’ subcorpus representing the modern urban variety of Skopje.

The informants belong to different religious communities (Islam, Orthodox Christianity) and a number of them are bilingual (Aromanian-Macedonian or Albanian-Macedonian). The speech of some of the informants can be characterized as dialectal, while others rather use a regional standard variety, i. e. predominantly standard Macedonian with individual dialectal features. The informants are anonymized.

The corpus is lemmatized and annotated for part-of-speech with minimal morphological information (in pronouns, nouns and verbs).


Anastasia Escher: software, field research, transcription, annotation.

Olivier Winistörfer: field research, OCR, annotation.

Giulia Morra: OCR.


1. Subcorpus of texts in Western Macedonian dialects (field data)

ID Year Dialect Birth year Place Tokens Themes Religion L1
T1 2016 Prespa 1956 Resen 2848 childhood Orthodox Mac
T2 2016 Prespa 1956 Resen 452 biography Orthodox Mac
T3 2012 Reka 1948 Janche 1030 vampires Islam Mac
T4 2012 Reka 1948 Janche 1402 wedding Islam Mac
T5 2012 Reka 1948 Janche 1537 wedding
Islam Mac
T6 2012 Reka 1948 Janche 1121 wedding
Islam Mac
T7 2012 Reka 1948 Janche 1461 family Islam Mac
T8 2016 Prespa 1963 Krani 1083 fairy tale Islam Alb
T9 2014 Prespa 1956 Resen 889 proverbs Orthodox Mac
T10 2014 Prespa 1956 Resen 855 proverbs
Orthodox Mac
T11 2014 Prespa 1956 Resen 733 biography Orthodox Mac
T12 2014 Prespa 1956 Resen 677 biography Orthodox Mac
T13 2014 Prespa 1939 Krani 370 vampires Islam Alb
T14 2014 Prespa 1949 Arvati 1578 biography Orthodox Mac
T15 2012 Reka 1948 Janche 1632 family Islam Mac
T16 2012 Reka 1948 Janche 1250 wedding Islam Mac
T17 2013 Reka 1949 Janche 1077 wedding Islam Mac
T18 2014 Prespa 1950 Resen 290 biography Orthodox Arom
T19 2014 Prespa 1950 Resen 114 biography Orthodox Arom
T20 2015 Prespa 1950 Resen 178 biography Orthodox Arom

2. 'Bombi' subcorpus

In addition to the data collected during field research trips to Western Macedonia, the Macedonian Spoken Corpus also includes data from the variety spoken in Skopje (the capital of North Macedonia) which may be characterized as a supra-dialectal standard.

While it is based on the standard language, it has also accumulated traits from a diversity of regional varieties as a result of the continuous migration to the capital.

Besides, it contains a number of Serbo-Croatian features mostly in the lexical domain, as it was dominant in Macedonia for a long time and was taught in schools as an obligatory subject. In the Macedonian Spoken Corpus, this variety is represented by the Bombi subcorpus.

The Bombi subcorpus contains transcripts of wiretapped conversations of the members of the Macedonian political elite connected with the then-ruling party VMRO-DPMNE.

Victor Friedman describes these transcriptions as follows:

In 2008, Kosovo became independent, Greece blocked Macedonia’s accession to NATO, and a coalition of right-wing nationalist Macedonian and Albanian political parties (VMRO-DPMNE [henceforth simply VMRO] and DUI/BDI, respectively) subsequently took control of the government and the state.

From 2008–2015, the Macedonian government, headed by Prime Minister and VMRO party leader Nikola Gruevski, illegally wire-tapped the telephones of 20,000 citizens, including everyone in the government itself except the Prime Minister’s direct line to his fi rst cousin, who was head of the Administration for Security and Counter-Intelligence (UBK), i.e. the secret police.

In 2015, the main opposition party, SDSM, led by Zoran Zaev, obtained the sound fi les and published selections of conversations held by members of the Prime Minister’s government in a series of press releases entitled Vistinata za Makedonija ‘The Truth about Macedonia’ but referred to in the press as the Bombi ‘Bombs’ (Bombi 2015). [...] the transcripts provide a  fascinating insight into modern colloquial Macedonian as used by educated elites in Skopje today.

As  dialectal transcriptions, the Bombi-subcorpus is  available in XML-format and at the search page.

2. Popvasileva subcorpus

The Popvasileva subcorpus contains texts collected by Alexandra Popvasileva for her PhD disseration «Bilingual storytelling of folk fairy tales» (Mac. «Двојазичното раскажување на народни приказни (влашко-македонски и македонско-влашки релации)») which she defended in Ljubljana in 1983 and publishedin Skopje in 1987.

The author of the dissertation analyzes the complex sociolinguistic situation in the Macedonian town of Krushevo, one of the most significant Aromanian centers in the country, based on a collection of bilingual storytelling.

The informants were native speakers of Aromanian and most of them were born at the beginning of the 20th century. They were asked to tell folk tales both in Aromanian and in Macedonian, their second language (L2).

These texts were published in the dissertation and they represent an invaluable source for research of both cultural anthropology and language contact. As  other dialect corpus transcriptions in the Macedonian Spoken Corpus, the tales from the Popvasileva subcorpus are available both in the searching database and in the TEI XML format:

  1. Куќата од сол и куќата од керамиди (Text ID: PV1).

/Tag set


The corpus is annotated for part-of-speech (PoS), lemma and minimal morphological information (only the noun gender and verbal aspect are marked).

The PoS-annotation is based, with minor changes, on the MULTEXT-East morphosyntactic specifications which define categories and their morphosyntactic features.

Only personal pronouns are marked with a complete tag marking all of their morphological categories:

Grammatical conventions

The morphological annotation of the tokens has been simplified in the following way:

  • Passive participles are always annotated as verbs (’Vi’ or ’Vp’), as opposed to the solution applied in MULTEXT-East morphosyntactic specifications.
  • No distinction is made between clitic and full pronouns.
  • The polysemic lexeme da is always annotated as a conjunction (’C’).
  • The го and му pronoun forms are always annotated as masculine, although they can refer to neuter referents.
  • The preposition на, which also functions as an indirect object marker, is always annotated as an adposition.

Tag set

  1. Noun
  2. Adjective
  3. Pronoun
  4. Verb
  5. Adverb
  6. Adposition
  7. Conjunction
  8. Numeral
  9. Interjection
  10. Punctuation
PoS tag Full tag Meaning Example
N Nm N-noun, m-masculin човек 'human, man', град 'town'
N Nf N-noun, f-feminine девојка 'girl', вода 'water'
N Np N-noun, p-proper Италија 'Italy', Оливер 'Olivier'
A A A-adjective убав 'beautiful', ладен 'cold'
P Pp1-sn P-pronoun, p-personal, 1-first person, (no gender),
s-singular, n-nominaive
јас 'I'
P Pp1-sd P-pronoun, p-personal, 1-first person, (no gender),
s-singular, d-dative
ми '(to) me'
P Pp1-sa P-pronoun, p-personal, 1-first person, (no gender),
s-singular, a-accusative
ме, мене 'me'
P Pp2-sn P-pronoun, p-personal, 2-second person, (no gender),
s-singular, n-nominative
ти 'you'
P Pp2-sd P-pronoun, p-personal, 2-second person, (no gender),
s-singular, d-dative
ти '(to) you'
P Pp2-sa P-pronoun, p-personal, 2-second person, (no gender),
s-singular, a-accusative
те, тебе 'you (direct object)'
P Pp3msn P-pronoun, p-personal, 3-third person, m-masculine,
s-singular, n-nominative
тој 'he'
P Pp3msd P-pronoun, p-personal, 3-third person, m-masculine,
s-singular, d-dative
му, нему '(to) him'
P Pp3msa P-pronoun, p-personal, 3-third person, m-masculine,
s-singular, a-accusative
го, него 'him'
P Pp3fsn P-pronoun, p-personal, 3-third person, f-feminine,
s-singular, n-nominative
таа 'she'
P Pp3fsd P-pronoun, p-personal, 3-third person, f-feminine,
s-singular, d-dative
ѝ, нејзе '(to) her'
P Pp3fsa P-pronoun, p-personal, 3-third person, f-feminine,
s-singular, a-accusative
ја, неа 'her'
P Pp3nsn P-pronoun, p-personal, 3-third person, n-neuter,
s-singular, n-nominative
тоа 'it'
P Pp1-pn P-pronoun, p-personal, 1-first person, (no gender),
p-plural, n-nominative
ние 'we'
P Pp1-pd P-pronoun, p-personal, 1-first person, (no gender),
p-plural, d-dative
нам '(to) us'
P Pp1-pa P-pronoun, p-personal, 1-first person, (no gender),
p-plural, a-accusative
нас 'us'
P Pp2-pn P-pronoun, p-personal, 2-second person, (no gender),
p-plural, n-nominative
вие 'you (pl.)'
P Pp2-pd P-pronoun, p-personal, 2-second person, (no gender),
p-plural, d-dative
вам '(to) you (pl.)'
P Pp2-pa P-pronoun, p-personal, 2-second person, (no gender),
p-plural, a-accusative
вас 'you (pl., object form)'
V Vi V-verb, i-imperfective чека 'wait'
V Vp V-verb, p-perfective рече 'say'
V Vb V-verb, b-biaspectual анализира 'analyse'
R R R-adverb многу 'much'
S S S-adposition за 'for', на 'on'
S C C-conjunction дека 'that', кога 'when'
M M M-numeral два 'two', трети 'third'
J J J-interjection о!, е!
Z Z Z-punctuation .,!?:,

Macedonian Spoken corpus is being created as a part of SNF-project ‘Ill-bred sons’, family and friends: tracing the multiple affiliations of Balkan Slavic at the Department of Slavonic languages and literatures of the University of Zurich. The corpus is available for use by all who are interested in studying Macedonian language and dialects. If corpus data are used in a publication, please provide a reference to the online-resource. The collection if the annotated texts is being constantly expanded.

Report an error

Since the corpus is currently in a development phase, the annotation of the transcripts may contain errors. We would be grateful for any comment or suggestion of the improvement of the linguistic content of the resource.


Recomended citation

Escher, Anastasia; Winistörfer, Olivier; (eds., 2021). Macedonian Spoken Corpus. Zürich: UZH Institute of Slavic Studies. Available online at escher.pythonanywhere.com (last access: )


The photos for the home page header are taken from the storage of licence free pictures Pixabay.