Using Word2vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language

Savytska L. V.; Vnukova N. M.; Bezugla I. V.; Pyvovarov V.; Sübay M. T.

Будь ласка, використовуйте цей ідентифікатор, щоб цитувати або посилатися на цей матеріал: https://repository.hneu.edu.ua/handle/123456789/26122

Назва:	Using Word2vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language
Автори:	Savytska L. V. Vnukova N. M. Bezugla I. V. Pyvovarov V. Sübay M. T.
Теми:	word2vec NLP cosine similarity semantic relations morphologicaword vectorsl (linguistics) relations word vectors word embedding Ukrainian language
Дата публікації:	2021
Бібліографічний опис:	Savytska L. V. Using Word2vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language / L. V. Savytska, N. M. Vnukova, I. V. Bezugla at el. ‒ CEUR Workshop Proceedings, 2021. ‒ Р. 235–248.
Короткий огляд (реферат):	The study presents the word translation into vectors of real numbers (word embeddings), one of the most important topics in natural language processing. Word2vec is the latest techniques developed by Tomas Mikolov to study high quality vectors. The majority of studies on clustering the word vectors were made in English. Dmitry Chaplinsky has already counted and published vectors for the Ukrainian language by using LexVec, Word2vec and GloVe techniques, obtained from fiction, newswire and ubercorpus texts, for VESUM dictionary and other related NLP tools for the Ukrainian language. There was no research done on the vectors by using Word2vec technique to create Ukrainian corpus, obtained from Wikipedia dump as the main source. The collection contains more than two hundred and sixty one million words. The dictionary of words (unique words) obtained from the corpus is more than seven hundred and nine thousand. The research using machine technology Word2vec is of great practical importance to computerise many areas of linguistic analysis. The open-source Python programming language was used to obtain word vectors with Word2vec techniques and to calculate the cosine proximity of the vectors. In order to do machine learning with Word2vec techniques on Python, a resource containing open source licensed software libraries called "Gensim" was used. Calculations regarding the cosine affinities of the obtained vectors were made using "Gensim" libraries. The research examining the clustering of the word vectors obtained from the Ukrainian corpus was made considering the two sub-branches of linguistics, semantics and morphology (language morphology). Firstly, it was investigated how accurately the vectors are obtained from the Ukrainian corpus and how the words represent the cluster they belong to. Secondly, it was investigated how word vectors are clustered and associated respectively to the morphological features of the suffixes of the Ukrainian language.
URI (Уніфікований ідентифікатор ресурсу):	http://repository.hneu.edu.ua/handle/123456789/26122
Розташовується у зібраннях:	Статті (МТМСФТ)

Файли цього матеріалу:

Файл	Опис	Розмір	Формат
paper21.pdf		1,34 MB	Adobe PDF	Переглянути/відкрити

Показати повний опис матеріалу Перегляд статистики

Усі матеріали в архіві електронних ресурсів захищені авторським правом, всі права збережені.