Home

KFUPM ePrints

In this section

SOFTWARE REQUIREMENTS CLASSIFICATION USING DEEP LEARNING DEALING WITH DATA SCARCITY

SOFTWARE REQUIREMENTS CLASSIFICATION USING DEEP LEARNING DEALING WITH DATA SCARCITY. PhD thesis, King Fahd University of Petroleum and Minerals.

Preview

PDF
SOFTWARE REQUIREMENTS CLASSIFICATION USING DEEP LEARNING DEALING WITH DATA SCARCITY.pdf - Accepted Version
Download (2MB) | Preview

Arabic Abstract

يعد تصنيف المتطلبات المستندة إلى اللغة الطبيعية يدويا إلى فئات أمرا معقدا وعرضة للخطأ ويستغرق وقتا طويلاً. عملية تصنيف المتطلبات مؤتمتة باستخدام خوارزميات التعلم الآلي التقليدية (ML) والتعلم العميق (DL). حققت الخوارزميات التقليدية القائمة على MLنتائج واعدة مقارنة بالطرق القائمة على القواعد. ومع ذلك ، فان الخوارزميات التقليدية القائمة على MLلها العديد من أوجه القصور ، مثل ضرورة تحفيز هندسة الميزات الشاملة الباحثين على إيجاد حلول بديلة. لقد تجاوزت النماذج المستندة إلى DLأحدث ما توصلت إليه التكنولوجيا في جميع مهام البرمجة اللغوية العصبية الهامة تقريبا. تستخدم مناهج تصنيف المتطلبات الحالية بشكل أساسي الميزات المعجمية والنحوية لتصنيف المتطلبات باستخدام كل من التعلم الآلي التقليدي وأساليب التعلم العميق ذات النتائج الواعدة. ومع ذلك ، تعتمد التقنيات الحالية على تراكيب الكلمات والجمل وتستخدم تقنيات المعالجة المسبقة وهندسة الميزات لتصنيف المتطلبات من مستندات اللغة الطبيعية النصية. علاوة على ذلك ، تتعامل الدراسات الحالية مع تصنيف المتطلبات كمشكلات تصنيف ثنائية أو متعددة الطبقات وليس كتصنيف متعدد العلامات ، على الرغم من أن أحد المتطلبات يمكن أن ينتمي إلى فئات متعددة في نفس الوقت. بسبب مشكلة وضع العلامات ، هناك فجوة كبيرة في دقة النماذج بين لغات الموارد المحدودة مثل العربية مقارنة باللغات الغنية بالموارد. على حد علمنا ، نظرت الأساليب الحالية في متطلبات البرامج المكتوبة في الغالب باللغة ٕ الانجليزية ولغات أخرى باستخدام النص الروماني. جميع مجموعات البيانات المتاحة غير متوازنة حيث يكون الاختلاف بين عدد عينات الفئات ً مرتفعا. هناك نقص في تقنية زيادة البيانات التي يمكنها تحسين النتائج وتقليل تحيز مجموعة البيانات غير المتوازنة في مجال متطلبات البرامج. في xxiiهذه الرسالة ، نصنف المتطلبات إلى أنواع وظيفية وغير وظيفية مختلفة مع الحد الأدنى من المعالجة المسبقة ونمذجة المهمة كمشكلة تصنيف متعددة التسميات. بالاضافة إلى ذلك ، تم التحقق من تصنيف المتطلبات من اللغة الأم أو مستند البرنامج المترجم باستخدام الترجمة الآلية ) (AMTمع فحص تأثيرها على ثلاث مجموعات بيانات مختلفة مكتوبة في الأصل إما باللغة العربية أ ٕ و الانجليزية. أ ً خيرا ، نقترح تقنية زيادة البيانات المستندة إلى القاموس والمعتمدة على المحولات لمعالجة مشكلات الحجم غير المتوازن أو المحدود لمجموعة البيانات في مجال متطلبات البرامج. التجارب التي أجريت على مجموعات البيانات PROMISEو EHRالمتاحة للجمهور تظهر فعالية التقنيات المقدمة. نحقق أحدث النتائج في معظم المهام باستخدام تسلسل الكلمات كرموز. يمكن تصنيف المتطلبات بشكل فعال إلى فئات وظيفية ومختلفة غير وظيفية باستخدام نظام التعلم العميق القائم على الشبكات العصبية المتكرر المقدم ، والذي يتضمن الحد الأدنى من استيعاب النص وعدم وجود هندسة للميزات..

English Abstract

Natural language documents are the primary source of requirements in a software project. Requirements classification is an essential step toward automatically analyzing natural language-based requirement artifacts. Requirements are typically classified into two main categories, namely, Functional Requirements (FRs) and Nonfunctional Requirements (NFRs). Nonfunctional requirements are further classified into a range of quality requirements, such as ’security’, ’availability’, and ’usability’. Software success depends on adherence to both functional and nonfunctional requirements. Hence, requirement engineers need to identify and categorize both functional and nonfunctional requirements from a range of natural language-based artifacts. Existing requirements classification approaches mainly use lexical and syntactical features to classify requirements using both traditional machine learning and deep learning approaches with promising results. However, the existing techniques depend on word and sentence structures and employ preprocessing and feature engineering techniques to classify requirements from textual natural language documents. Moreover, existing studies deal with requirements classification as binary or multiclass classification problems and not as multilabel classification, although a given requirement can belong to multiple classes at the same time. Additionally, data scarcity and data imbalanced are two main challenges in software requirements classification using deep learning techniques. As a result, the classifier’s performance is affected and biased against classes with few samples. The issue of data scarcity is further exacerbated for low-resource languages such as Arabic. In previous studies, existing approaches investigated software requirement classification for requirements written mostly in English language and other languages using the Roman script. In this dissertation, we classify requirements into functional and nonfunctional and investigate the multilabel classification problem. An end-toend deep learning pipeline is presented to perform requirements classification with on the performance of Arabic requirements classification. Machine Translation (MT) impact has been examined on datasets written originally either in Arabic or English languages to provide an alternative approach for low-resource language like Arabic by translating the Arabic requirements into English and then using the system trained on English dataset to perform requirements classification. Experiments conducted on publicly available datasets show the effectiveness of the presented techniques. We achieve state-of-the-art results on most tasks using word sequences as tokens. Requirements can be effectively classified into functional and nonfunctional categories using the proposed recurrent neural network-based deep learning system, which involves minimal text prepossessing and no feature engineering. Further improvements in classification results are reported using our adaptive data augmentation techniques and dictionary-based transformer-based data augmentation technique. Several experiments show the possibility of using machine translation of Arabic requirements as a viable alternative to employ language-specific NLP processing and to deal with data scarcity issues. minimal preprocessing and no feature engineering. In addition, we investigate data augmentation techniques for dealing with the issues of limited and imbalanced training data. We present Adaptive data augmentation techniques using class and sample length to dynamically assign percent of change and number of augmentation for one and multilabel classification. We present dictionary-based and transformer-based data augmentation techniques to address issues of textual data augmentation in the software requirements domain. Finally, the classification of the requirements written in Arabic as a low-resource language has been investigated. Several experiments have been conducted to highlight the effects of different language-specific NLP techniques on the performance of Arabic requirements classification. Machine Translation (MT) impact has been examined on datasets written originally either in Arabic or English languages to provide an alternative approach for low-resource language like Arabic by translating the Arabic requirements into English and then using the system trained on English dataset to perform requirements classification. Experiments conducted on publicly available datasets show the effectiveness of the presented techniques. We achieve state-of-the-art results on most tasks using word sequences as tokens. Requirements can be effectively classified into functional and nonfunctional categories using the proposed recurrent neural network-based deep learning system, which involves minimal text prepossessing and no feature engineering. Further improvements in classification results are reported using our adaptive data augmentation techniques and dictionary-based transformer-based data augmentation technique. Several experiments show the possibility of using machine translation of Arabic requirements as a viable alternative to employ language-specific NLP processing and to deal with data scarcity issues.

Item Type:	Thesis (PhD)
Subjects:	Computer
Department:	College of Computing and Mathematics > Information and Computer Science
Committee Advisor:	MAHMOOD, SAJJAD
Committee Co-Advisor:	AHMAD, IRFAN
Committee Members:	NIAZI, MAHMOOD and ALSHAYEB, MOHAMMAD and AHMED, MOATAZ
Depositing User:	osama Mohammed (g201306350)
Date Deposited:	15 Nov 2022 10:18
Last Modified:	10 Apr 2025 06:22
URI:	http://eprints.kfupm.edu.sa/id/eprint/142236