Performance of Methods in Identifying Similar Languages Based on String to Word Vector

Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This s...

पूर्ण विवरण

में बचाया:

ग्रंथसूची विवरण
मुख्य लेखक:	Sujaini, Herry
स्वरूप:	UMS Journal (OJS)
भाषा:	eng
प्रकाशित:	Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia 2020
विषय:	identification of languages; local languages; string to word vector
ऑनलाइन पहुंच:	https://journals.ums.ac.id/index.php/khif/article/view/8199
टैग:	टैग जोड़ें कोई टैग नहीं, इस रिकॉर्ड को टैग करने वाले पहले व्यक्ति बनें!

_version_	1805342465317666816
author	Sujaini, Herry
author_facet	Sujaini, Herry
author_sort	Sujaini, Herry
collection	OJS
description	Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This study made an effort to identification Indonesian local languages, which used String to Word Vector approach. A string vector refers to a collection of ordered words. In a string vector, a word is represented as an element or value, while the word becomes an attribute or feature in each numeric vector. Among Naïve Bayes, SMO, J48, and ZeroR classifiers, SMO is found to be the most accurate classifier with a level of accuracy at 95.7% for 10-fold cross-validation and 94.4% for 60%: 40%. The best tokenizer in this classification is Character N-Gram. All classifiers, except ZeroR shows increased accuracy when using Character N-Gram Tokenizer compared to Word Tokenizer. The best features of this system are the TriGram and FourGram Character. The TriGram is preferred because it requires smaller training data. The highest accuracy value in the combination experiment is 0.965 obtained at a combination of IDF = FALSE and WC = TRUE, regardless the conditions of the TF.
format	UMS Journal (OJS)
id	oai:ojs2.journals.ums.ac.id:article-8199
institution	Universitas Muhammadiyah Surakarta
language	eng
publishDate	2020
publisher	Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia
record_format	ojs
spelling	oai:ojs2.journals.ums.ac.id:article-8199 Performance of Methods in Identifying Similar Languages Based on String to Word Vector Sujaini, Herry identification of languages; local languages; string to word vector Indonesia has a large number of local languages that have cognate words, some of which have similarities among each other. Automatic identification within a family of languages faces problems, so it is necessary to learn the best performer of language identification methods in doing the task. This study made an effort to identification Indonesian local languages, which used String to Word Vector approach. A string vector refers to a collection of ordered words. In a string vector, a word is represented as an element or value, while the word becomes an attribute or feature in each numeric vector. Among Naïve Bayes, SMO, J48, and ZeroR classifiers, SMO is found to be the most accurate classifier with a level of accuracy at 95.7% for 10-fold cross-validation and 94.4% for 60%: 40%. The best tokenizer in this classification is Character N-Gram. All classifiers, except ZeroR shows increased accuracy when using Character N-Gram Tokenizer compared to Word Tokenizer. The best features of this system are the TriGram and FourGram Character. The TriGram is preferred because it requires smaller training data. The highest accuracy value in the combination experiment is 0.965 obtained at a combination of IDF = FALSE and WC = TRUE, regardless the conditions of the TF. Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia 2020-04-22 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion application/pdf https://journals.ums.ac.id/index.php/khif/article/view/8199 10.23917/khif.v6i1.8199 Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika; Vol. 6 No. 1 April 2020; 9-14 Khazanah Informatika; Vol. 6 No. 1 April 2020; 9-14 2477-698X 2621-038X eng https://journals.ums.ac.id/index.php/khif/article/view/8199/5506 Copyright (c) 2020 Khazanah Informatika: Jurnal Ilmu Komputer dan Informatika http://creativecommons.org/licenses/by/4.0
spellingShingle	identification of languages; local languages; string to word vector Sujaini, Herry Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title	Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title_full	Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title_fullStr	Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title_full_unstemmed	Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title_short	Performance of Methods in Identifying Similar Languages Based on String to Word Vector
title_sort	performance of methods in identifying similar languages based on string to word vector
topic	identification of languages; local languages; string to word vector
topic_facet	identification of languages; local languages; string to word vector
url	https://journals.ums.ac.id/index.php/khif/article/view/8199
work_keys_str_mv	AT sujainiherry performanceofmethodsinidentifyingsimilarlanguagesbasedonstringtowordvector

Performance of Methods in Identifying Similar Languages Based on String to Word Vector

समान संसाधन