A Recurrent Neural Network-Based Approach for Word-Level Diacritization of Arabic Text

A Recurrent Neural Network-Based Approach for Word-Level Diacritization of Arabic Text. Masters thesis, King Fahd University of Petroleum and Minerals.

[img] PDF
2019_MS Thesis_Salih Taqiuddin.pdf - Published Version
Restricted to Repository staff only until 3 January 2021.

Download (4MB)

Arabic Abstract

إنَّ التَّشْكِيلَ الْآلِيَّ لِلنَّصِّ الْعَرَبِيِّ حَقْلٌ مُهِمٌّ وَنَاشِطٌ مِنْ حُقُولِ الْبَحْثِ الْعِلْمِيِّ، وَذُو مَجَالَاتِ تَطْبِيقٍ عَدِيدَةٍ مِنْ ضِمْنِهَا: أَنْظِمَةُ تَحْوِيلِ النُّصُوصِ إِلَى كَلَامٍ، وَأَدَوَاتُ مُعَاوَنَةِ مُتَعَلِّمِي اللُّغَةِ الْعَرَبِيَّةِ. وَعَلَى الرَّغْمِ مِنِ اجْتِذَابِ هَذَا الْحَقْلِ لِلْكَثِيرِ مِنَ الْأَبْحَاثِ خِلَالَ الْعُقُودِ الثَّلَاثَةِ الْمَاضِيَةِ، فَإنَّ الْحَاجَةَ إِلَى التَّطْوِيرِ وَالتَّحْسِينِ فِيهِ مَا تَزَالُ قَائِمَةً. فِي الْوَقْتِ الرَّاهِنِ، يَتَمَتَّعُ مَجَالُ التَّعَلُّمِ الْمُتَعَمِّقِ – اَلْمُنْدَرِجُ فِي حَقْلِ التَّعَلُّمِ الْآلِيِّ – بِنَجَاحٍ كَبِيرٍ وَانْتِشَارٍ فِي نِطَاقٍ وَاسِعٍ مِنَ الْمَجَالَاتِ التَّطْبِيقِيَّةِ. وَبِشَكْلٍ أَكْثَرَ تَحْدِيدًا، فَإنَّ بُنْيَةَ الشَّبَكَةِ الْعَصَبُونِيَّةِ الرَّاجِعَةِ تَسْتَعِيدُ اهْتِمَامًا كَبِيرًا لَدَى الْبَاحِثِينَ، وَتُحَقِّقُ أَفْضَلَ النَّتَائِجِ الْمَنْشُورَةِ فِي الْعَدِيدِ مِنْ حُقُولِ الْبَحْثِ الْعِلْمِيِّ. لَقَدْ سَبَقَ وَأَنْ أُجْرِيَتْ أَبْحَاثٌ عِدَّةٌ تَتَطَرَّقُ إِلَى مُشْكِلَةِ الْبَحْثِ بِاسْتِخْدَامِ نُهُجِ التَّعَلُّمِ الْمُتَعَمِّقِ. لَكِنَّ نُهُجَ مُعْظَمِ هَذِهِ الْأَبْحَاثِ مَبْنِيَّةٌ عَلَى مُعَالَجَةِ الْمُشْكِلَةِ وَحَلِّهَا بِالنَّظَرِ إِلَى مُسْتَوَى أَحْرُفِ النَّصِّ الْمُرَادِ تَشْكِيلُهُ. فَالْهَدَفُ مِنْ هَذِهِ الرِّسَالَةِ هُوَ اسْتِكْشَافُ ذَاتِ النَّهْجِ (أَيِ الْقَائِمِ عَلَى التَّعَلُّمِ الْمُتَعَمِّقِ)، غَيْرَ أَنَّهُ مَبْنِيٌّ عَلَى مُعَالَجَةِ مُشْكِلَةِ الْبَحْثِ بِالنَّظَرِ إِلَى مُسْتَوَى الْكَلِمَاتِ. وَتَتَلَخَّصُ إسْهَامَاتُ الرِّسَالَةِ فِي (1) تَطْوِيرِ تَعْرِيفٍ مَنْهَجِيٍّ لِمُشْكِلَةِ الْبَحْثِ بِشَكْلٍ مُسْتَقِلٍّ عَنْ نُهُجِ الْحُلُولِ الْمُتَّبَعَةِ لَهَا، (2) وَتَأْلِيفِ إِطَارٍ مُحْكَمٍ لِمُشْكِلَةِ الْبَحْثِ، (3) وَاسْتِحْدَاثِ مِقْيَاسٍ جَدِيدٍ لِتَقْيِيمِ أَدَاءِ نِظَامِ أَوْ نَهْجِ تَشْكِيلٍ، (4) وَالِاكْتِشَافِ وَالْمُعَالَجَةِ التِّلْقَائِيَّيْنِ لِحَالَاتٍ عِدَّةٍ مِنْ إعْرَابِ اللُّغَةِ الْعَرَبِيَّةِ. وَقَدْ تَمَّ فِي هَذَا الْعَمَلِ تَحْقِيقُ النَّتَائِجِ التَّالِيَةِ: 70.16% خَطَأً عَلَى مُسْتَوَى الشَّكْلَةِ، 38.16% خَطَأً عَلَى مُسْتَوَى الْحَرْفِ، و 53.38% خَطَأً عَلَى مُسْتَوَى الْكَلِمَةِ. يَعْنِي ذَلِكَ أَنَّهُ تَمَّ تَشْكِيلُ حَوَالَيْ 84% مِنْ أَحْرُفِ النَّصِّ شَكْلًا صَحِيحًا.

English Abstract

Automatic Arabic text diacritization (AATD) is an active and important field of research. It has numerous applications including text-to-speech systems and as an aid for learners of Arabic. Although it has attracted a lot of research in the last three decades, there is still need for improvement. Currently, the deep learning area of machine learning field is enjoying great success and spread in a wide range of applications. More specifically, the recurrent neural network architecture is regaining high interest and achieving state-of-the-art published results in various research fields. Research employing deep learning approaches to the AATD problem has already been conducted in multiple publications. However, most of these works built their approaches on a character-level basis. This thesis aims to explore a deep learning approach to the problem that is developed on a pure word-level basis. Its contributions are (1) developing a formal definition of the research problem that is independent of solution approaches, and (2) assembling a coherent framework for the research problem, and (3) introducing a new metric for evaluating the performance of a diacritization system or approach, and (4) the automatic detection and processing of multiple cases of Arabic declension. The results achieved in this work are 16.70% diacritic error rate, 16.38% character error rate, and 38.53% word error rate. This means that around 84% of the text's characters are correctly diacritized.

Item Type: Thesis (Masters)
Subjects: Computer
Divisions: College Of Computer Sciences and Engineering > Information and Computer Science Dept
Committee Advisor: Al-Khatib, Wasfi
Committee Members: Ghouti, Lahouari and Husni, Al-Muhtaseb
Depositing User: SALIH TAQIUDDIN (g201206380)
Date Deposited: 22 Jan 2020 15:45
Last Modified: 22 Jan 2020 15:45
URI: http://eprints.kfupm.edu.sa/id/eprint/141415