Advancing Writer Identification in Historical Arabic Manuscripts: Building a Robust Dataset and Optimizing Ai Models

Saleh, Abdel

Home

KFUPM ePrints

In this section

Advancing Writer Identification in Historical Arabic Manuscripts: Building a Robust Dataset and Optimizing Ai Models

Saleh, Abdel (2026) Advancing Writer Identification in Historical Arabic Manuscripts: Building a Robust Dataset and Optimizing Ai Models. Masters thesis, King Fahd University of Petroleum and Minerals.

PDF (Advancing Writer Identification in Historical Arabic Manuscripts: Building a Robust Dataset and Optimizing AI Models)
AbdelRahman_Saleh_Thesis Final Submit with greenpage.pdf
Download (14MB)

Arabic Abstract

الاسم: عبدالرحمن صالح عنوان الرسالة: تعزيز التعرف على الكاتب في المخطوطات العربية التاريخية: بناء بيانات موثقة وتدريب نماذج الذكاء الاصطناعي التخصص: علوم الحاسب الآلي تاريخ الدرجة العلمية: يونيو 2026 يُعدّ التعرّف على الكاتب في المخطوطات العربية التاريخية عملية مهمة في توثيق المخطوطات، وفهرستها، ونسبة الأجزاء المتفرقة منها، والحفاظ على التراث الثقافي. ومع ذلك، لا يزال التقدّم في هذا المجال محدودًا بسبب ندرة البيانات المعدة لهذا الغرض، وصعوبة التحقق من النسبة الحقيقية للخط، وخطر تداخل نتائج التقييم عندما تُوزَّع صفحات من المخطوط نفسه بين مجموعات التدريب والاختبار. وتعالج هذه الرسالة هذه التحديات من خلال تقديم مجموعة البيانات WrIHAM، وهي مجموعة بيانات مُنقّحة ومخصّصة للتعرّف على الكاتب في المخطوطات العربية التاريخية، بالإضافة إلى تطوير عملية معالجة قائمة على التعلم العميق وتقييمها ضمن ظروف واقعية على مستوى المخطوطات. تتكوّن النسخة النهائية من WrIHAM من 62 كتابًا مخطوطًا كتبها 18 كاتبًا معروفًا، بإجمالي 1,144 صورة صفحة. ويمثَّل كل كاتب بثلاثة كتب مختلفة على الأقل، مما يتيح إنشاء تقسيمات تدريب وتحقق واختبار منفصلة على مستوى الكتاب. وقد جُمعت مجموعة البيانات ونُقّحت بدعم من خبراء في نسبة الخطوط. ويشجع هذا التصميم النماذج على تعلّم خصائص الخط المرتبطة بالكاتب نفسه بدلًا من الاعتماد على خصائص خاصة بالمخطوط، مثل نسيج الورق، أو التخطيط، أو ظروف المسح الضوئي، أو مظاهر التلف والتدهور. وللتعرّف على الكاتب، تطوّر هذه الرسالة عملية معالجة متكاملة قائمة على استخراج الأسطر من الصفحات. إذ تُعالَج صفحات المخطوطات أولًا باستخدام نموذج لاستخراج الأسطر مدرَّب على مجموعة بيانات MIRATH. بعد ذلك، تُصنَّف الأسطر المستخرجة باستخدام نماذج تعلم عميق، ثم تُستنتج التنبؤات النهائية على مستوى الصفحة من خلال تجميع احتمالات التنبؤ على مستوى الأسطر. وقد استخدمت طريقة فصل العوامل لتقييم أفضل العوامل وأنجحها في تمكين نموذج الذكاء الاصطناعي في الوصول لأعلى نتيجة، حيث تم اختبار مجموعة متنوعة من معالجة الألوان وكذلك تدريب النموذج على الأسطر كاملة أو على أجزاء مستخرجة من كل سطر لتشابه حالة الكلمات. ثم جرى تحسين إعداد EfficientNet-B4 المختار باستخدام البحث عن المعاملات الفائقة، وتقييمه مرة واحدة على مجموعة الاختبار المحجوبة ضمن بروتوكول الفصل على مستوى الكتب. حقق النموذج النهائي المُحسَّن دقة Macro Top-1 مقدارها 58,75\%، ودقة Macro Top-2 مقدارها 68,29\%، ودقة Macro Top-3 مقدارها 81,48\% ضمن إعداد الاختبار المفصول على مستوى الكتب. وفي المقابل، حقق بروتوكول الكتب المشتركة أداءً قريبًا من الحد الأعلى، حيث بلغت دقة Macro Top-1 مقدارها 97,17\% باستخدام EfficientNet-B4 و96,25\% باستخدام DeiT-Base. وتوضح هذه الفجوة الكبيرة أن بروتوكول التقييم يؤثر تأثيرًا جوهريًا في الأداء المقاس، وتؤكد أن التقييم المفصول على مستوى الكتب ضروري للحصول على معيار واقعي للتعرّف على الكاتب. وتُظهر النتائج أن WrIHAM توفر معيارًا صعبًا ومفيدًا، كما تؤكد في الوقت نفسه أن التعرّف على الكاتب في المخطوطات العربية التاريخية لا يزال مشكلة مفتوحة وصعبة.

English Abstract

Name: Abdel Rahman Saleh Title of Study: Advancing Writer Identification in Historical Arabic Manuscripts: Building a Robust Dataset and Optimiz- ing AI Models Major Field: Computer Science Date of Degree: June 2026 Writer identification in historical Arabic manuscripts is an important task for manuscript authentication, cataloging, fragment attribution, and cultural heritage preservation. However, progress in this area remains limited due to the scarcity of reliable datasets, the difficulty of confirming true handwriting attribution, and the risk of inflated evaluation results when pages from the same manuscript are shared across training and testing splits. This thesis addresses these challenges by introducing WrIHAM, a curated dataset for writer identification in historical Arabic manuscripts, and by developing a deep learning pipeline evaluated under realistic manuscript-level conditions. The final version of WrIHAM contains 62 handwritten books by 18 known writers, totaling 1,144 page images. Each writer is represented by at least three distinct books, enabling book-disjoint train, validation, and test splits. The dataset was curated with expert-supported attribution. This design encourages models to learn writer-specific handwriting characteristics rather than manuscript-specific artifacts such as paper texture, layout, scanning conditions, or degradation patterns. For writer identification, this thesis develops a line-based deep learning pipeline. Manuscript pages are first processed using a MIRATH-trained line extraction model. The extracted text lines are then classified using deep neural networks, and page-level predictions are obtained by aggregating line-level probabilities. A staged Book-Disjoint model-selection ladder was used to evaluate input representations, binarization strategies, model backbones, and aggregation methods. The selected EfficientNet-B4 configuration was further optimized using hyperparameter search and evaluated once on the held-out Book-Disjoint test split. The final optimized model achieved 58.75\% macro Top-1 accuracy, 68.29\% macro Top-2 accuracy, and 81.48\% macro Top-3 accuracy under the Book-Disjoint test setting. In contrast, the Shared-Book evaluation protocol produced near-ceiling performance, with 97.17\% macro Top-1 accuracy for EfficientNet-B4 and 96.25\% for DeiT-Base. This large gap demonstrates that evaluation protocol strongly affects measured performance and confirms that book-disjoint evaluation is essential for realistic writer-identification benchmarking. The results show that WrIHAM provides a challenging and informative benchmark, while also highlighting that writer identification in historical Arabic manuscripts remains an open and difficult problem.

Item Type:	Thesis (Masters)
Subjects:	Computer Research
Department:	College of Computing and Mathematics > Information and Computer Science
Thesis Advisor:	Wasfi Al-khatib,
Thesis Committee Members:	Irfan Ahmad, Moayad Alnammi,
Depositing User:	ABDEL SALEH
Date Deposited:	16 Jun 2026 10:21
Last Modified:	01 Jul 2026 05:05
URI:	https://eprints.kfupm.edu.sa/id/eprint/144512