Home

KFUPM ePrints

In this section

Robustness and Stability Analyses of Ensemble Learning Models in Android Malware Detection

Robustness and Stability Analyses of Ensemble Learning Models in Android Malware Detection. Masters thesis, King Fahd University of Petroleum and Minerals.

Preview

PDF (M.Sc Thesis Report)
Lama_Thesis_Finallll.pdf - Submitted Version
Download (18MB) | Preview

Arabic Abstract

نمت البرامج الضارة لنظام Android بمعدل ينذر بالخطر في السنوات الأخيرة. حلول مكافحة الفيروسات التقليدية القائمة على التوقيع غير قادرة على مواكبة التقدم المتزايد للبرامج الضارة المكتشفة كل يوم. نتيجة لذلك ، تم تقديم أبحاث جديدة نحو الجمع بين أساليب التحليل المختلفة ، وخاصة التحليل الساكن ، )Static Analysis( مع تقنيات التعلم الآلي. مع تطور البرامج الضارة ، أصبحت بعض التحديات في اكتشاف البرامج الضارة لنظام Android أكثر خطورة. على الرغم من أن مجموعات البيانات المعيارية ITL) In-The-Lab( المتاحة للجميع قديمة ولم تعد تعكس البرامج الضارة الحالية ، إلا أنها تظل مستخدمة في تقييم اكتشاف البرامج الضارة في Android. بالإ ضافة إلى ذلك ، فإن ندرة البرامج الضارة لنظام Android بالنسبة للتطبيقات الحميدة تسبب مشكلة عدم توازن )Imbalance Problem( تؤثر على تطوير نماذج الكشف القائمة على التعلم الآلي. لمعالجة مشكلة عدم توازن الفئة ، تم اقتراح طرق مختلفة للتعلم الجمعي )Ensemble Learning( لتحسين أداء الكشف في مثل هذه السيناريوهات. وبالتالي ، فإن الهدف الأساسي لهذه الرسالة هو فحص التحليل الساكن و التعلم الجمعي لغرض اكتشاف البرامج الضارة لنظام Android في نهج متين و ثابت. لتسليط الضوء على الاختلافات بين البرامج الضارة ITL و ITW) In-The-Wild( ، قمنا بفحص إمكانيات التعتيم )Obfuscation( لتطبيقات Android التي تم الحصول عليها من كلا المصدرين. أشارت النتائج إلى أن ITW تحتوي على برامج ضارة أكثر تقدمًا و تعتيمًا. لزيادة مرونة التحليل الساكن ضد تعتيم البرامج الضارة ، تم استخراج مجموعة متنوعة من الميزات )Features( باستخدام طريقتين للتحليل الساكن ؛ تم استخدام تحليل خصائص Android وأكواد التشغيل لإ نشاء مجموعات بيانات ITL و ITW. بعد إنشاء مجموعات البيانات ، تم أخذ عينات منها للحصول على توزيعات مختلفة غير متوازنة الفئة ومتوازنة الفئة لتقييم أداء عدد من نماذج التعلم الجمعي المجانس )Homogeneous( وغير المتجانس )Heterogeneous( في الإ كتشاف غير المتوازن. بينما أثر عدم توازن الفئة على جميع النماذج ، كشفت النتائج أن نماذج التعلم الجمعي ، وخاصة المتجانس القائمة على أشجار القرار )Tree-based( ، قادرة على اكتشاف غالبية البرامج الضارة لنظام Android حتى عندما تكون مجموعات البيانات غير متوازنة. علاوة على ذلك ، أشارت نتائج التقييم للنماذج الموجودة على مجموعات البيانات ITL و ITW إلى أن مجموعة البيانات ITW تمثل تحديًا أكبر للنماذج. هذا الاستنتاج يؤيد تحليلنا السابق للتعتيم. نتيجة ً لذلك ، نوصي الباحثين باستخدام مجموعة بيانات ITW بدلا ً من مجموعة البيانات ITL القديمة. بالإ ضافة إلى ذلك ، تم إجراء تحليل شامل لمتانة و ثبات النماذج باستخدام اختبار Wilcoxon Signed Rank Test و Clustering. أظهرت نتائج هذا التحليل أن المجموعات المتجانسة احتلت المرتبة الأفضل من حيث المتانة و الإ ستقرار ، تليها المجموعات غير المتجانسة المكدسة )Stacked( ، بينما احتلت النماذج التقليدية المرتبة الأخيرة.

English Abstract

Android malware has grown at an alarming rate in recent years. Traditional signature- based anti-virus solutions are unable to keep up with the increasing advancement of malware discovered everyday. As a result, additional research has been dedicated to- ward combining various analysis approaches, especially static analysis, with machine learning techniques. As malware evolves, certain challenges in Android malware detection have become more significant. Although the publicly available In-The-Lab (ITL) benchmark datasets are outdated and are no longer reflective of current malware, they remain employed in the evaluation of Android malware detection. Additionally, the scarcity of Android malware relative to benign applications causes an imbalance problem that affects the development of machine learning-based detection models. To address the issue of class imbalance, various ensemble learning methods have been proposed to enhance detection performance in such scenarios. Consequently, the primary objective of this thesis is to examine static analysis and ensemble learning for the purpose of detecting Android malware in a robust and stable approach. To highlight the differences between ITL and In-The-Wild (ITW) malware, this study investigated the obfuscation capabilities of Android applications obtained from both sources. The results indicated that ITW contains more advanced and obfuscated malware. To increase the resilience of static analysis to malware obfuscation, a diverse set of features extracted using two static analysis methods; Android characteristics and Opcodes analysis, was used to construct ITL and ITW datasets. After creating the datasets, they were sampled to have different imbalanced and balanced class distributions to evaluate the performance of several homogeneous and heterogeneous ensembles’ performance in imbalanced de- tection. While class imbalance impacted all models, the results revealed that ensemble models, particularly homogeneous Tree-based ensembles, are capable of detecting the majority of Android malware even when the datasets were imbalanced. Furthermore, evaluation results for the models on ITL and ITW datasets indicated that ITW dataset presented a greater challenge to the models. This conclusion corroborates the previous analysis of the obfuscation level in the datasets. As a result, researchers are encouraged to use ITW dataset rather than the outdated ITL dataset. Additionally, a comprehensive analysis of the robustness and stability of the models was conducted using Wilcoxon signed rank test and clustering. The results of this analysis showed that homogeneous ensembles ranked best in terms of robustness and stability, followed by stacked heterogeneous ensembles, while conventional models ranked last.

Item Type:	Thesis (Masters)
Subjects:	Computer
Department:	College of Computing and Mathematics > Information and Computer Science
Committee Advisor:	Aljamaan, Hamoud
Committee Members:	Mahmoud, Sajjad and Rahmani, Md Mahfuzur
Depositing User:	LAMA AL BAKHAT (g201901830)
Date Deposited:	05 Jun 2022 06:23
Last Modified:	10 Apr 2025 06:27
URI:	http://eprints.kfupm.edu.sa/id/eprint/142146