Macedonian Spoken Corpus

/Simple tagger

The corpus tagger is available for use as a Python package.

You can use it locally or annotate your Macedonian text here (please read the information on dealing with homonyms and unknown words here).

The tagger is quite simple and doesn't apply amy ML algorithms. It is meant for smaller projects with dialect data and it's output requires manual correction.

To use the package locally, install it via pip:

pip install spoken_macedonian_annotation

Use the package either in a code editor as here:

from spoken_macedonian_annotation.annotate import MacAnnotator

text = 'Ова е мојата куќа.'
annotator = MacAnnotator(print_to_txt_file=True, mark_homonyms=False, mark_unknown_tokens=False)
result = annotator.annotate(text)
print(result)

Or in a terminal with this command:

annotateMac -i your_text_to_annotate.txt --print_to_txt --mark_homonyms --mark_unknown

Note, that you might need to download nltk package data before using the tagger:

import nltk
nltk.download('punkt')

/ Macedonian Spoken Corpus

/Simple tagger

To use the package locally, install it via pip: