Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning

Language Identification (LID) aims to guess or identify which language the text or sound is coming from. Language identification tends to be easier in languages with different characteristics (e.g., Indonesian and English), but not for languages with similar characteristics (e.g., Indonesian and Mal...

Full description

Saved in:
Bibliographic Details
Main Authors: Abdiansah, Abdiansah, Rizqie, Muhammad Qurhanul
Format: UMS Journal (OJS)
Language:eng
Published: Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia 2023
Subjects:
Online Access:https://journals.ums.ac.id/index.php/khif/article/view/21669
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1805342463686082560
author Abdiansah, Abdiansah
Rizqie, Muhammad Qurhanul
author_facet Abdiansah, Abdiansah
Rizqie, Muhammad Qurhanul
author_sort Abdiansah, Abdiansah
collection OJS
description Language Identification (LID) aims to guess or identify which language the text or sound is coming from. Language identification tends to be easier in languages with different characteristics (e.g., Indonesian and English), but not for languages with similar characteristics (e.g., Indonesian and Malaysian). Similar languages can cause ambiguity that will be a bias for machine learning. Using Support Vector Machine (SVM) technique, this research tried to identify the Indonesian or Malaysian language. The training and testing data are taken from Leipzig Corpora Collection and Twitter dataset. The feature representation technique uses TF-IDF, and the baseline testing uses Naive Bayes Multinomial. We used two training techniques: split (20:80) and 10-cross validation. The experimental results show that the accuracy between the baseline and SVM is not too far. Both provide accuracy of around 90% and above. The results indicate that Indonesian and Malaysian language identification accuracy is relatively high even though using simple techniques.
format UMS Journal (OJS)
id oai:ojs2.journals.ums.ac.id:article-21669
institution Universitas Muhammadiyah Surakarta
language eng
publishDate 2023
publisher Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia
record_format ojs
spelling oai:ojs2.journals.ums.ac.id:article-21669 Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning Abdiansah, Abdiansah Rizqie, Muhammad Qurhanul Language Identification; Indonesian; Malaysian; Support Vector Machine Language Identification (LID) aims to guess or identify which language the text or sound is coming from. Language identification tends to be easier in languages with different characteristics (e.g., Indonesian and English), but not for languages with similar characteristics (e.g., Indonesian and Malaysian). Similar languages can cause ambiguity that will be a bias for machine learning. Using Support Vector Machine (SVM) technique, this research tried to identify the Indonesian or Malaysian language. The training and testing data are taken from Leipzig Corpora Collection and Twitter dataset. The feature representation technique uses TF-IDF, and the baseline testing uses Naive Bayes Multinomial. We used two training techniques: split (20:80) and 10-cross validation. The experimental results show that the accuracy between the baseline and SVM is not too far. Both provide accuracy of around 90% and above. The results indicate that Indonesian and Malaysian language identification accuracy is relatively high even though using simple techniques. Department of Informatics, Universitas Muhammadiyah Surakarta, Indonesia 2023-10-29 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion application/pdf https://journals.ums.ac.id/index.php/khif/article/view/21669 10.23917/khif.v9i2.21669 Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika; Vol. 9 No. 2 October 2023; 104-110 Khazanah Informatika; Vol. 9 No. 2 October 2023; 104-110 2477-698X 2621-038X eng https://journals.ums.ac.id/index.php/khif/article/view/21669/8758 https://journals.ums.ac.id/index.php/khif/article/downloadSuppFile/21669/5515 https://journals.ums.ac.id/index.php/khif/article/downloadSuppFile/21669/5516 Copyright (c) 2023 Abdiansah Abdiansah https://creativecommons.org/licenses/by/4.0
spellingShingle Language Identification; Indonesian; Malaysian; Support Vector Machine
Abdiansah, Abdiansah
Rizqie, Muhammad Qurhanul
Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title_full Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title_fullStr Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title_full_unstemmed Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title_short Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning
title_sort automatic language identification for indonesian malaysian language using machine learning
topic Language Identification; Indonesian; Malaysian; Support Vector Machine
topic_facet Language Identification; Indonesian; Malaysian; Support Vector Machine
url https://journals.ums.ac.id/index.php/khif/article/view/21669
work_keys_str_mv AT abdiansahabdiansah automaticlanguageidentificationforindonesianmalaysianlanguageusingmachinelearning
AT rizqiemuhammadqurhanul automaticlanguageidentificationforindonesianmalaysianlanguageusingmachinelearning