A Hybrid Framework for Detecting and Mitigating Jailbreak Attacks in Large Language Models

Alhetelah, Bushra

Home

KFUPM ePrints

In this section

A Hybrid Framework for Detecting and Mitigating Jailbreak Attacks in Large Language Models

Alhetelah, Bushra (2026) A Hybrid Framework for Detecting and Mitigating Jailbreak Attacks in Large Language Models. Masters thesis, King Fahd University of Petroleum and Minerals.

PDF
A HYBRID FRAMEWORK FOR MITIGATING JAILBREAK ATTACKS IN LARGE LANGUAGE MODELS - Bushra Alhetelah .pdf
Restricted to Repository staff only until 18 August 2026.
Download (2MB)

Arabic Abstract

تُعد نماذج اللغة الكبيرة (LLMs) قادرة على أداء مهام متقدمة في معالجة اللغة الطبيعية، مثل الإجابة على الأسئلة، والتلخيص، والترجمة. ومع ذلك، فإنها تظل عرضة لهجمات الأوامر الخبيثة، بما في ذلك ما يُعرف بـ “هجمات كسر الحماية”، حيث يتم تصميم الأوامر بطريقة تمكن من تجاوز آليات السلامة وإنتاج مخرجات محظورة. تُصنف أساليب الدفاع الحالية عادةً إلى ثلاث فئات رئيسية: أساليب ما قبل المعالجة، والأساليب أثناء المعالجة، وأساليب ما بعد المعالجة. وعلى الرغم من أن كل فئة توفر مستوى جزئيًا من الحماية، فإنها لا تقدم حلًا شاملًا في مواجهة الطبيعة المتطورة والتكيفية للأوامر الخبيثة. ولمعالجة هذه القيود، تقترح هذه الدراسة إطارًا هجينيًا يدمج بين استراتيجيات ما قبل المعالجة وما بعد المعالجة، بحيث يكون قابلًا للتطبيق على كل من النماذج مفتوحة المصدر والمغلقة. تعتمد مرحلة ما قبل المعالجة على تحليل النية لتفسير الهدف الكامن وراء مدخلات المستخدم قبل إجراء الاستدلال، مما يتيح الكشف المبكر عن المدخلات الضارة أو المخالفة للسياسات. وفي المقابل، تستخدم مرحلة ما بعد المعالجة نموذج حراسة لتقييم الاستجابات المُولدة وفق معايير سلامة محددة مسبقًا، مما يسمح برفض الاستجابات غير الآمنة أو إعادة صياغتها، مع الحفاظ على المخرجات السليمة. تم تقييم الإطار المقترح باستخدام مجموعة بيانات JailBreakV-28K عبر عدة نماذج لغوية كبيرة وتحت إعدادات دفاعية مختلفة. وأظهرت النتائج أن جميع آليات الدفاع تسهم في خفض معدل نجاح الهجوم (ASR) مقارنة بالحالة الأساسية، مع تفوق أساليب ما بعد المعالجة على ما قبل المعالجة عند تطبيقها بشكل مستقل. كما حقق الإطار الهجين أدنى معدل نجاح للهجوم عبر جميع النماذج وأنماط الإدخال، مع انخفاض يصل إلى نحو 99%. وتؤكد هذه النتائج فعالية استراتيجيات الدفاع متعددة الطبقات في تعزيز متانة نماذج اللغة الكبيرة في مواجهة هجمات كسر الحماية المتطورة، مع الحفاظ على كفاءة الأداء في البيئات التطبيقية

English Abstract

Large Language Models (LLMs) can perform advanced natural language tasks such as question answering, summarization, and translation. However, they remain vulnerable to malicious prompting attacks, including “jailbreak attacks,” where prompts are crafted to bypass safety controls and produce restricted responses. Existing defense methods are typically categorized into three classes: pre-processing, in-processing, and post-processing approaches. While each provides partial protection, none offers a comprehensive solution against the evolving and adaptive nature of malicious prompts. To address these limitations, this work proposes a hybrid framework that integrates pre-processing and post-processing defense strategies, applicable across both open- and closed-source LLMs. The pre-processing layer applies intention analysis to interpret the underlying intent of user prompts before model inference, enabling early detection of potentially malicious or policy-violating prompts. In parallel, the post-processing layer employs a guard model to evaluate generated responses based on predefined safety criteria, allowing unsafe responses to be rejected or rewritten while preserving benign responses. The proposed framework is evaluated using the JailBreakV-28K dataset across multiple LLMs under different defense configurations. Results show that all defenses reduce the Attack Success Rate (ASR) compared to the baseline, with post-processing outperforming pre-processing when applied independently. The hybrid configuration consistently achieves the lowest ASR across the evaluated models under the present experimental setup, with reductions of up to approximately 99%. These findings highlight the effectiveness of multi-layer defense strategies in improving the robustness of LLMs against evolving jailbreak attacks while maintaining practical performance.

Item Type:	Thesis (Masters)
Subjects:	Computer
Department:	College of Computing and Mathematics > Computer Engineering
Thesis Advisor:	Ahmad Almulhem,
Thesis Committee Members:	Ashraf Mahmoud, Muhamad Felemban,
Depositing User:	BUSHRA ALHETELAH
Date Deposited:	18 May 2026 10:45
Last Modified:	30 Jun 2026 09:21
URI:	https://eprints.kfupm.edu.sa/id/eprint/144330