An-Najah National University  

Faculty of Graduate Studies 

 
NEBRAS: A RAG-BASED QUESTION 

ANSWERING SYSTEM FOR ISLAMIC AND 

LEGAL GUIDANCE 
 

By 

Samer Nitham Al-Huwari 

 
Supervisor 

Dr. Hamed Abdelhaq 

 
This Thesis is Submitted in Partial Fulfillment of the Requirements for the Degree 

of Master of Artificial Intelligence, Faculty of Graduate Studies, An-Najah 

National University, Nablus - Palestine. 

2025 


II 

 
NEBRAS: A RAG-BASED QUESTION 

ANSWERING SYSTEM FOR ISLAMIC AND 

LEGAL GUIDANCE 
 

By 

Samer Nitham Al-Huwari 

 
This Thesis was Defended Successfully on 27/02/2025 and approved by 

 
Dr. Hamed Abdelhaq   

Supervisor  Signature 

   
Prof. Mohammed Awad   

External Examiner  Signature 

   
Dr. Ajmad Hawwash   

Internal Examiner  

 
Signature 


III 

Dedication 

To my incredible family—my parents and siblings—whose unwavering love, patience, 

and encouragement have been a constant source of strength throughout this journey. Your 

faith in me has been my greatest motivation, and your words and have guided me forward. 

To Dr. Hamed, my mentor and guide, who not only supervised this research but also 

opened doors to opportunities I never imagined. Your belief in my potential have shaped 

my academic journey profoundly. 

Lastly, to myself, for demonstrating unwavering determination and confronting this 

journey with courage. This achievement stands as a testament to resilience, perseverance, 

and an enduring belief in the promise of brighter days ahead. 

 
IV 

Acknowledgements 

I would like to express my heartfelt gratitude to all those who contributed to the success 

of this research. My sincere thanks go to the Palestinian Dar Al-Ifta’a for their invaluable 

assistance in evaluating the fatwas generated by the system, providing critical insights 

that greatly enriched this work. I am also deeply thankful to An-Najah National University 

for providing a sample of Islamic fatwa related to the local Palestinian community, which 

served as a resource for implementing the Islamic Fatwa answer generation. Without the 

support and collaboration of these esteemed institutions, this research would not have 

been possible. 

 
V 

Declaration 

 
I, the undersigned, declare that I submitted the thesis entitled: 

 
NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR ISLAMIC 

AND LEGAL GUIDANCE 

 
I declare that the work provided in this thesis, unless otherwise referenced, is the 

researcher’s own work, and has not been submitted elsewhere for any other degree or 

qualification. 

 
Student's Name: Samer Nitham Al-Huwari 

 
Signature: 

 
Date: 27/02/2025 

 
VI 

List of Contents 

Dedication ....................................................................................................................... III 

Acknowledgements ......................................................................................................... IV 

Declaration ....................................................................................................................... V 

List of Contents ............................................................................................................... VI 

List of Tables ................................................................................................................ VIII 

List of Figures ................................................................................................................. IX 

List of Appendices ........................................................................................................... X 

Abstract ........................................................................................................................... XI 

Chapter One: Introduction and Theoretical Background .................................................. 1 

1.1 Theoretical Background .............................................................................................. 5 

1.1.1 Question Answering ................................................................................................. 5 

1.1.3 Prompt Engineering ................................................................................................. 7 

1.1.4 Retrieval Augmented Generation ............................................................................. 8 

1.1.5 Agents in Large Language Models ........................................................................ 14 

1.2 Literature Review ..................................................................................................... 14 

1.3 Problem Statement .................................................................................................... 18 

1.4 Aims of Study ........................................................................................................... 20 

1.5 Hypotheses of Study ................................................................................................. 21 

1.5.1 Accuracy Hypothesis ............................................................................................. 21 

1.5.2 Adaptability and Scalability Hypothesis ................................................................ 21 

1.5.3 Language-Specific Performance Hypothesis ......................................................... 22 

Chapter Two: Methods .................................................................................................... 23 

2.1 Data Collection ......................................................................................................... 23 

2.1.1 Islamic Fatwa Dataset Collection .......................................................................... 24 

2.1.2 University Help-desk Dataset Collection .............................................................. 24 

2.2 Data Pre-processing and Structuring ........................................................................ 26 


VII 

2.2.1 Islamic Fatwa Dataset Pre-processing and Structuring ......................................... 27 

2.2.2 NNU Dataset Pre-processing and Structuring ....................................................... 31 

2.3 Implementation ......................................................................................................... 32 

2.3.1 Vector Database ..................................................................................................... 32 

2.3.2 Indexing and Chunking .......................................................................................... 34 

2.3.3 QA Pipeline ............................................................................................................ 35 

Chapter Three: Experimentation and Results ............................................................ 42 

3.1 Indexing and chunking .............................................................................................. 43 

3.2 Implementation ......................................................................................................... 47 

3.3 Experiment with Islamic Fatwa Dataset ................................................................... 47 

3.3.1 Baseline Evaluation: Responses Without Retrieval ............................................... 47 

3.3.2 Context Effectiveness on Fatwa Answer Generation............................................. 49 

3.3.3 Employing RAG .................................................................................................... 52 

3.4 Experiment with An-najah National University Dataset .......................................... 60 

3.4.1 Baseline Evaluation: Responses Without Retrieval ............................................... 61 

3.4.2 Hybrid Retrieval Pipeline Evaluation with NNU Dataset ..................................... 62 

Chapter Four: Discussions and Conclusion .................................................................... 64 

List of Abbreviations ...................................................................................................... 66 

References ....................................................................................................................... 67 

Appendices ...................................................................................................................... 77 

 ب ................................................................................................... الملخص 

 
VIII 

List of Tables 

Table 1: Islamweb Fatwa Fields Mapping ...................................................................... 24 

Table 2: NNU Academic Majors Fields Mapping .......................................................... 25 

Table 3: NNU Academic Courses Fields Mapping ......................................................... 26 

Table 4: Token Counts from the Scraped Islamweb Dataset .......................................... 27 

Table 5: Frequency of Start and Trailing Words in Introductory Sentences in Fatwa .... 29 

Table 6: Frequency of First Words After Removing Introductory Sentences in Fatwa .. 29 

Table 7: Token Counts from the Scraped Islamweb Dataset After Cleaning .................. 30 

Table 8: Embedding Models Evaluation ......................................................................... 46 

Table 9: Summary of Implementation Technologies ...................................................... 47 

Table 10: Palestinian Dar Al Ifta'a Evaluation of Generated Fatwas .............................. 50 

  
IX 

List of Figures 

Figure 1: Traditional RAG Pipeline .................................................................................. 9 

Figure 2: Two-Level Hierarchical Structure Indexing for One Document ..................... 10 

Figure 3: Query Expansion Using HyDE ....................................................................... 13 

Figure 4: Dataset Field Mapper ...................................................................................... 34 

Figure 5: QA Agentic RAG Pipeline .............................................................................. 36 

Figure 6: Hybrid Retrieval Pipeline ................................................................................ 41 

Figure 7: Ground Truth Evaluation With and Without Context ...................................... 51 

Figure 8: Comparing GT Automatic Metrics & Human Evaluation............................... 58 

Figure 9: Ground Truth Comparison Between Nebras and Baseline Models ................. 60 

Figure 10: NNU Dataset Ground Truth Scores Comparison .......................................... 63 

 
X 

List of Appendices 

Appendix A: Google Gemini Response to Accounting Major Required GPA ............... 77 

Appendix B: GPT-4o Response to Accounting Major Required GPA ........................... 80 

Appendix C: LLM and Mufti Responses to Fatwa on Promoting Products ................... 81 

Appendix D: Dataset Scraped Fields Mapping ............................................................... 86 

Appendix E: Fatwa Dataset Category Distribution ........................................................ 88 

Appendix F: Field Augmentation for Structuring NNU Dataset .................................... 89 

Appendix G: Query Decomposition Agent's Prompt ...................................................... 90 

Appendix H: Query Classification Agent's Prompt ........................................................ 92 

Appendix I: Candidate Answer Agent Prompt ............................................................... 95 

Appendix J: Context Relevance Prompt ......................................................................... 96 

Appendix K: Answer Generation Agent Prompt ............................................................ 99 

Appendix L: Evidence Extraction Agent Prompt ......................................................... 101 

Appendix M: Irrelevant Query Response Agent Prompt .............................................. 104 

Appendix N: Islamic Fatwa with No Context (Baseline Evaluation) ........................... 106 

Appendix O: Prompt for Generating Context-based Answers ...................................... 107 

Appendix P: PDR Evaluation ....................................................................................... 108 

Appendix Q: Retrieval with Ranking Model ................................................................ 109 

Appendix R: HyDE ....................................................................................................... 110 

Appendix S: Query to Question ..................................................................................... 111 

Appendix T: Query to Topic ......................................................................................... 112 

Appendix U: Hybrid Retriever ..................................................................................... 113 

Appendix V: Ground Truth Evaluation for 100 Fatwas ................................................ 114 

Appendix W: NNU Baseline ......................................................................................... 115 

Appendix X: NNU Hybrid Retrieval ............................................................................ 116 

Appendix Y: Experimentation Download Link ............................................................ 117 

 
XI 

NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR 

ISLAMIC AND LEGAL GUIDANCE 

By 

Samer Nitham Al-Huwari 

Supervisor 

Dr. Hamed Abdelhaq 

 
Abstract 

Question answering (QA) systems are essential tools in natural language processing 

(NLP), designed to interpret user queries and generate relevant answers. These systems 

have evolved over time from rule-based models to advanced machine-learning-based 

approaches. The emergence of the transformers architecture and Large Language Models 

(LLMs) have set the stage for modern QA systems. 

LLMs have transformed QA by leveraging vast datasets to generate human-like responses 

across various domains and their ability to understand complex linguistic patterns. 

However, LLMs often generates plausible but incorrect answers particularly in 

specialized domains like law and religion where accuracy is critical. This phenomenon is 

known as “hallucination”. The risk of “hallucination” is increased when dealing with a 

complex language like Arabic. Arabic language, with its rich morphology, diverse 

dialects, and its dependency on diacritics, present significant challenges for LLMs 

primarily trained on Western languages. 

Fine-tuning LLMs for domain-specific tasks is time-intensive, and computationally-

expensive, given their massive parameters size, demanding innovative approaches to 

mitigate the LLMs hallucination issue without extensive re-training. 

This thesis introduces Nebras, a generic multi-domain QA system leveraging a Retrieval-

Augmented Generation (RAG) framework, LLM agents, and a hybrid retrieval approach. 

Nebras’s knowledge base can be dynamically extended by following simple guidelines 

and using its built-in mapping component, enabling it to adapt to any textual dataset. By 

employing an Agentic RAG pipeline, Nebras optimizes each processing stage using 

specialized agents. Furthermore, it utilizes pre-trained LLMs without fine-tuning, 

enhancing scalability and reducing computational costs. 


XII 

Experimentation results demonstrated Nebras’s performance in Arabic domain-specific 

QA. In the Islamic fatwa domain, it achieved a BERTScore-F1 of 70.94%, a METEOR 

of 13.49%, with 9 accepted fatwas compared to only 7 accepted from GPT-4o. In the 

university help-desk domain, Nebras achieved a BERTScore-F1 of 75.80%, METEOR of 

40.20%, and BLEU of 9%, significantly outperforming the BLEU score of 2.3% from 

GPT-4o's. These results highlight Nebras's ability to enhance factual accuracy, confirming 

its potential as a scalable Arabic QA solution. 

Keywords: NEBRAS, Multi-domain Arabic question-answering, Agentic Retrieval-

Augmented Generation, Large Language Models


1 

Chapter One 

Introduction and Theoretical Background 

Research in language modeling began in the early 1950s with Shannon's work exploring 

the predictive and compressive capabilities of simple n-gram models on natural language 

[1]. Over time,  statistical language modeling became a foundational technique in various 

natural language understanding and generation tasks, such as speech recognition, 

information retrieval, and question answering [2]. 

The introduction of neural network-based approaches represented a paradigm shift in 

language modeling, moving beyond traditional statistical methods [3]. Early deep 

learning approaches laid the groundwork for future progress by enabling the capture of 

more complex textual patterns and contextual dependencies [4]. This advancement led to 

the development of transformer architectures which revolutionized the field by enabling 

the effective handling of long-range dependencies and parallel processing [5]. These 

advancements marked a new era of NLP and paved the way for the emergence of 

LLMs[6]. 

LLMs have significantly advanced NLP by demonstrating their ability to understand and 

generate human-like text [7]. LLMs scale, architecture, and their training on extensive 

and diverse datasets, enabled them to capture hidden and complex patterns in language 

[8]. These capabilities allowed the models to achieve state-of-the-art performance across 

a variety of NLP tasks, including but not limited to text summarization, sentiment 

analysis, machine translation, and question-answering [9] . 

LLM capabilities are often tested by the complexities of other languages like Arabic 

[10,11]. Unlike most Western languages, Arabic is characterized by its rich and complex 

morphology, its dependence on diacritics, and its multiple verb variations and noun forms 

[12]. This complexity makes accurate language processing a challenge. Diacritical marks 

(harakat) are often omitted in most standard texts in written Arabic, such as newspapers, 

books, and other general writing. These marks indicate short vowels and certain 

grammatical features, and their absence can introduce ambiguity in word meaning and 

pronunciation [13]. This ambiguity arises because many Arabic words are written as 

skeletons without diacritics, making the intended word dependent on context. For 

example, the root "كتب" (k-t-b) can mean "he wrote" (kataba), or "books" (kutub), 


2 

depending on the diacritics and context. This ambiguity becomes even more noticeable 

when paired with the wide variety of Arabic dialects, which often differ significantly from 

Modern Standard Arabic in vocabulary, syntax, and pronunciation [11]. Combining these 

factors with the lack of high-quality Arabic datasets, lead to weaker performance and 

frequent misunderstandings in Arabic-language tasks compared to languages with richer 

datasets [14]. 

In the field of QA, LLMs typically rely on the knowledge acquired during training and 

stored within their parameterized memory to generate responses [15]. This reliance makes 

them exposed to knowledge gaps in specialized domains, and difficulties in processing 

diverse languages, increasing the risk of generating irrelevant or factually incorrect 

answers —a phenomenon commonly referred to as hallucination [16,17]. Moreover, the 

inherent complexity of natural language often results in models misinterpreting questions, 

leading to responses that are either irrelevant or incorrect [18]. For instance, when posed 

with the Arabic question: 

 ما هو معدل القبول لتخصص المحاسبة في جامعة النجاح الوطنية؟

(English translation: What is the required GPA for the Accounting major at An-Najah 

National University?) 

Google's Gemini 1.5 Pro didn’t provide direct answer about the minimum required GPA, 

instead it explained why academic majors do not have a fixed minimum required GPA at 

An-najah national University, resulting in an irrelevant response. The generated response 

is detailed in Appendix A. 

When OpenAI’s GPT-4o was presented with the same question, it demonstrated a clear 

understanding of the query but incorrectly identified the minimum required GPA for 

admission as being between 80–85% while in 2024, the actual minimum GPA for 

admission to the Accounting major at An-Najah National University (NNU) is 65%. The 

complete response from GPT-4o can be found in Appendix B. 

These challenges are further increased when dealing with specialized domains in general, 

particularly in sensitive areas like legal counseling and religion, which often require deep 

knowledge understanding and reasoning, as well as access to information that may not 

have been available to the LLM during its creation [19]. 


3 

In Islamic religious context for example, Fatwas serve as a crucial guide for Muslims on 

legal, ethical, and religious matters. Fatwas are formal statements issued by qualified 

Islamic scholars (Muftis) after analyzing Quranic verses, Hadith narrations (teachings 

attributed to the Prophet Muhammad), and the established reasoning of past Islamic 

scholars and Imams to establish an answer for a specific case [20]. The significance of 

Fatwas lies in their ability to translate broad Islamic principles into practical guidance for 

individuals facing unique situations. They offer clarification on permitted actions, ethical 

dilemmas, and religious obligations in a constantly evolving world. Therefore, inaccurate 

or misleading Fatwas can have serious religious and legal repercussions for the user. 

Moreover, Islamic Fatwas can differ among scholars due to variations in jurisprudential 

schools of thought (madhahib), resulting in a diversity of perspectives on the same 

issue[21,22]. 

These factors highlight the importance of accurate and reliable resources when seeking 

guidance through Islamic Fatwas. Accurately understanding the context of Qur’anic and 

Hadith verses, and reasoning within Islamic text is crucial for navigating the complexities 

of Islamic law [11]. Current LLMs might struggle with this depth of domain-specific 

knowledge. For example, Google's Gemini 1.5 Pro and OpenAI's GPT-4o were asked the 

following Arabic fatwa question: 

 حكم الترويج لمنتجات لشركة مقابل عمولة مع شرط مسبق على المرّوج بدفع مبلغ مالي 

(translated to English): What is the Islamic Fatwa ruling on promoting a company's 

products in exchange for a commission, with a prior condition requiring the promoter to 

pay a sum of money? 

Both the Gemini and GPT-4o responses presented multiple rulings for various scenarios, 

deeming it جائز (permissible) in some cases, whereas the Mufti's response explicitly 

declared it حرام (forbidden). The detailed responses from both Google's Gemini 1.5 Pro 

and OpenAI's GPT-4o, along with the Mufti's ground truth answer, are provided in 

Appendices C - C.1, C.2, and C.3. 

In contrast, academic domains often involve factual queries requiring precision, clarity, 

and contextual understanding, such as university policies, course requirements, or 

research guidelines. These inquiries are typically based on carefully crafted institutional 

policies and thoroughly reviewed documentation [23,24]. Specialized academic domains 


4 

pose unique challenges for LLMs due to the diversity, complexity, and constant evolution 

of educational systems and their policies. Research highlights that LLMs struggle with 

reasoning tasks requiring specialized knowledge which necessitate efforts in domain-

specific data acquisition and fine-tuning [25]. The dynamic nature of academic content 

and institutional guidelines further complicates matters, as static training data risks 

outdated or biased responses in such domains [26,27]. 

Fine-tuning is one approach to mitigate LLM hallucinations, either by expanding the 

model's knowledge or enhancing its linguistic capabilities. However, updating a model's 

parameterized memory through fine-tuning is challenging, as there's no established 

method for directly overwriting existing knowledge, leading to potential knowledge 

conflicts—an ongoing research area [28,29]. Expanding a model's language capabilities 

within a specific domain typically involves fine-tuning on data relevant to that domain. 

If the model lacks initial support for the target language, pre-training on a large corpus in 

that language is usually necessary before domain-specific fine-tuning [30]. Furthermore, 

given the large number of parameters in LLMs, fine-tuning demands significant 

computational resources, making it a time-consuming and expensive process. However, 

advancements in parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank 

Adaptation (LoRA) or adapter modules, can enable targeted updates without requiring 

full re-tuning of the model, mitigating some of these challenges but requiring careful 

tuning of their hyperparameters, which can impact performance if not optimized [35]. 

An alternative approach to mitigating hallucination in LLMs is RAG. RAG integrates a 

language model with an external retriever that fetches relevant documents from a 

knowledge base, enabling the model to generate accurate responses by referencing up-to-

date information [19]. This method can be more efficient than fine-tuning because it 

avoids altering the model's internal parameters, reducing computational costs, and 

enabling dynamic updates to the knowledge base [32]. However, RAG has its challenges, 

including ensuring retrieval accuracy, managing outdated or conflicting information, and 

maintaining retrieval latency at scale [33]. Additionally, applying RAG to languages like 

Arabic requires effective retrieval mechanisms for morphologically rich languages [34]. 

Recent advancements in RAG, such as the incorporation of intelligent agents, have 

improved retrieval and synthesis processes, addressing limitations like retrieval accuracy 

and latency while expanding the system's capabilities [35–37]. 


5 

Agents in LLMs leverage their extensive reasoning capabilities to autonomously perform 

tasks by interpreting complex instructions and execute multi-step processes, acting as 

intermediaries between users and computational resources [38,39]. These agents 

demonstrate versatility across diverse domains such as social and natural sciences and 

engineering, but challenges like maintaining consistent performance, accurate contextual 

understanding, and seamless tool integration persist [38,40].  

Traditional RAG systems lack decision-making and validation mechanisms. In contrast, 

Agentic RAG integrates intelligent agents to dynamically select, process, and validate 

relevant information, enhancing response accuracy, contextual relevance, and robustness 

compared to traditional RAG systems [15,35,36]. In addition, employing agents into the 

RAG approach, implementing Agentic RAG, gives the ability to handle complex, multi-

step reasoning tasks more effectively and deliver contextually appropriate responses 

[37,41]. 

This research introduces Nebras, a novel multi-domain Arabic question-answering 

system designed for adaptability across various domains. Unlike traditional models that 

require domain-specific fine-tuning, Nebras processes textual datasets provided by users 

to expand its domain coverage. This capability allows the system to incorporate new 

datasets provided by administrators without the need for costly and time-consuming re-

training processes. Nebras leverages a hybrid retrieval approach that combines multiple 

techniques to ensure relevant information retrieval from large-scale corpora. It employs 

Agentic RAG pipeline which consist of multiple agents, each performing a specific task 

in the system’s proposed pipeline. These agents work in coordination with pre-trained 

LLMs, enabling Nebras to deliver accurate, factually grounded, and context-aware 

responses. This design ensures that Nebras remains flexible, scalable, and capable of 

addressing diverse Arabic language applications across various sectors. 

1.1 Theoretical Background 

This section presents an overview of the theoretical concepts in the field of question-

answering systems and recent advancements in the field. 

1.1.1 Question Answering 

QA is a core task within the field of NLP, which deals with enabling machines to 

understand and process human language. QA systems are designed to answer questions 


6 

posed by users in natural language aiming to provide relevant, accurate answers by 

analyzing the input query and generating appropriate responses [42,43]. 

Based on their reliance on different data structures, QA systems can be categorized into 

text-based and knowledge-based [44]. Knowledge-based QA utilizes structured 

knowledge bases (KBs) that store information as triples in the format (subject, predicate, 

object) [45]. These systems answer questions by querying single or multi-relation facts in 

the KB, with single-relation questions relying on one fact and multi-relation questions 

requiring reasoning over multiple connected facts [46,47]. 

In contrast, Text-based QA retrieves answers from unstructured text, such as documents 

or articles, by identifying and extracting the most relevant passages that matches the query 

[45]. These systems typically follow a three-steps pipeline [43]:  

1. question processing: involves query formulation and answer type detection using 

classifiers. 

2. document and passage retrieval: which employs information retrieval (IR) models to 

extract relevant text segments. 

3. answer extraction: where the system measures the similarity between the query and 

candidate answers to determine the most appropriate response. 

Advanced deep neural models are often employed to enhance text-based QA by 

accurately matching questions with potential answers [44,45]. 

1.1.2 Large Language Models 

LLMs are computational systems designed to understand and generate human language 

by leveraging statistical methods to predict word sequences or generate responses based 

on input [48]. These models achieve remarkable performance in tasks such as text 

generation, translation, and summarization (to name a few) due to their large-scale 

training on extensive datasets and their implementation of the Transformer architecture 

[49]. Central to this architecture is the self-attention mechanism, which enables efficient 

parallel processing and assigns varying importance to input tokens. This allows the model 

to capture long-range dependencies effectively [5]. Transformers have powered state-of-

the-art language models like Google's BERT [6], Facebook's RoBERTa [50], and 

OpenAI's GPT-3 and GPT-4 [8,51]. 

 
7 

The massive size of LLMs, billions to trillions of parameters (hence the name "Large"), 

is a critical factor in their performance. This scale enables LLMs to learn complex 

language patterns and develop sophisticated linguistic abilities [49]. Moreover, LLMs 

have demonstrated the emergence of novel capabilities, such as in-context learning [52]. 

This ability allows the models to adapt to specific tasks and generate contextually relevant 

responses, making them well-suited for a wide range of applications, including dialogue 

systems [48], step-by-step reasoning [53], even processing none-textual input like images 

and audio (known as Multi-modal Large Language Models) such as Google's PaLM [54]. 

As knowledgeable and plausible-sounding they are, LLM-generated responses can be 

nonsensical or factually incorrect, therefor cannot always be trusted. This phenomenon is 

common in LLMs and known as "hallucination" [55,56].  

LLMs typically relies on the knowledge they learned from the training process and stored 

in their parameterized memory [15]. Hallucinations often arise from models attempting 

to fill gaps in knowledge by generating responses based on probabilistic associations 

rather than verifiable knowledge [24]. The hallucination problem becomes particularly 

concerning in specialized domains, such as legal, or medical fields, where incorrect 

information can have serious consequences and can mislead users, undermining trust in 

LLM-generated responses. Even with their enhanced capabilities in understanding, 

reasoning, and generation, Multi-modal LLMs are not immune to generate hallucinated 

content that may appear plausible [57]. 

One established approach for interacting with LLMs is prompt engineering. This 

technique involves crafting specific textual prompts to guide the LLM's response 

generation process[62]. Users can steer the LLM's output towards desired outcomes and 

tasks by carefully designing these prompts. 

1.1.3 Prompt Engineering 

Prompt engineering enables LLMs to perform a wide array of tasks without requiring 

retraining nor fine-tuning. practitioners can guide LLMs toward generating contextually 

relevant and accurate outputs by designing the input prompts thoughtfully, leading to 

enhancing the LLMs performance [63]. 


8 

Prompt engineering involves crafting specific instructions or queries (prompts) that 

encourage the model to generate responses aligned with the user's goal. Recent studies 

have expanded the landscape of prompt engineering by exploring methods such as zero-

shot and few-shot prompting [60,61]. Zero-shot prompting allows LLMs to perform new 

tasks without any task-specific training by relying entirely on the model’s pre-trained 

knowledge base [61]. This technique is widely used for large models like GPT-3 where a 

well-structured prompt can enable the model to perform tasks it has never encountered 

before [51]. On the other hand, few-shot prompting has shown improvements in handling 

more complex tasks by providing the model with a few example inputs and outputs, even 

with minimal additional input data [62]. 

A notable advancement in prompt engineering is Chain-of-Thought (CoT) prompting 

which was introduced by Wei et al. in 2022 [53]. CoT prompts guide the model through 

logical steps, enhancing its ability to process and produce logical, reasoned outputs, 

which makes it particularly useful for complex reasoning tasks such as mathematical 

problem solving and commonsense reasoning. Further improvements to this approach, 

such as Auto-CoT, automate the generation of reasoning chains, thereby enhancing 

robustness and reducing human effort in creating example-based prompts [63]. 

The introduction of role-prompting has also improved the specificity of model outputs by 

assigning a "role" to the model in the prompt, such as "acting as an expert" or "a friendly 

assistant". This helps guide the model towards more contextually appropriate and accurate 

responses in various domains [64]. 

Despite these advancements, challenges persist in optimizing prompts for more complex 

tasks due to the influence of multiple factors, including task complexity, model biases, 

and token limitations [65,66]. 

1.1.4 Retrieval Augmented Generation 

RAG is a method that addresses factual hallucination and limitations in domain-specific 

knowledge for LLMs by incorporating external knowledge through information retrieval 

[32]. As illustrated in Figure 1, a traditional RAG-based QA system consists of three 

steps[23]: 

1. Indexing: storing vector representations (embeddings) of text chunks. These 

embeddings allow efficient retrieval of relevant text based on similarity and often 


9 

stored in a vector database. Vector databases are specialized data management 

systems designed for storing, indexing, and querying high-dimensional vectors. They 

support similarity search by using nearest-neighbor search algorithms, enabling 

efficient retrieval of semantically related data [67]. Unlike traditional relational 

databases, which organize data in tables with rows and columns and rely on structured 

query languages (SQL), vector databases handle unstructured data by representing it 

as numerical vectors. While regular databases perform exact-match searches, vector 

databases perform approximate nearest-neighbor searches. 

2. Retrieval: identifies the most relevant text chunks for a given question. It uses the 

same embedding model as the vector storage to find similar text based on the question 

embedding. 

3. Generation: generating answers from the retrieved text segments using a language 

model. 

The traditional RAG (also known as Naive RAG) adapts the Retrieve-Read method [68] 

which takes the user's query, matches it against indexed documents, then retrieve the most 

relevant k documents. 

Figure 1 

Traditional RAG Pipeline 

 
The Naive RAG approach faces significant drawbacks across its retrieval, generation, and 

augmentation stages. During retrieval, it struggles with selecting relevant or well-aligned 

content and may fail to retrieve essential information, leading to misaligned (inaccurate) 


10 

responses [76]. In the generation phase, the model often produces hallucinated or 

unsupported content, which compromises response quality and reliability. Additionally, 

issues of irrelevance in generated outputs can further reduce effectiveness [19,69]. The 

augmentation process presents challenges in addressing coverage errors and ensuring 

retrieved content aligns effectively with task-specific requirements [71]. Moreover, 

generation models can produce repetitive outputs or fail to integrate retrieved content 

meaningfully, resulting in responses that reflect surface-level repetition rather than deep 

contextual understanding [19]. To address these drawbacks, several modifications have 

been introduced: 

Index Structuring 

A structural index enhances retrieval in RAG systems by organizing documents 

hierarchically, creating multilevel parent-child relationships among document chunks 

instead of indexing text chunks independently without any relationships or links between 

them. This method is called Hierarchical Structure Indexing [19] were each child 

document stores its parent document id to retrieve the entire parent document instead of 

the text chunk , reducing issues associated with redundant or contextually-disjointed 

chunks. Figure 2 illustrates a two-level Hierarchical Indexing for a single document. 

Figure 2  

Two-Level Hierarchical Structure Indexing for One Document 

 
Another indexing structure method is Knowledge Graph Structure Indexing, where 

document chunks are represented as nodes, and the relation between chunks are 

represented as edges. Adding a knowledge graph (KG) index further strengthens this 


11 

structure by linking concepts and entities within the documents. This approach not only 

minimizes errors in retrieval but also translates the process into steps the LLM can 

interpret, leading to more accurate and contextually relevant responses. Methods like 

Knowledge Graph Prompting (KGP) [72] use KGs to represent document sections as 

nodes (such as pages or tables) and their connections as edges. This representation allows 

capturing semantic relationships and enabling coherent knowledge retrieval and 

reasoning across multiple documents [19]. 

Chunking Optimization 

Chunking in RAG is essential for efficient and accurate query answering. It balances 

providing enough relevant context with minimizing irrelevant data, thus improving 

retrieval quality and computational performance [73]. Fixed chunk sizes are often used in 

RAG pipelines and can sometimes lead to insufficient or excessive information within 

each chunk. Techniques such as recursive chunking and sliding windows address this 

issue by segmenting content based on natural language structures, like punctuation and 

sentence boundaries, or by overlapping chunks to preserve coherence [74]. For 

documents with clear structures, like financial reports, more advanced chunking methods 

like element-based chunking provides a tailored approach by using structural cues like 

titles and tables to create chunks, leading to more accurate retrieval [74]. While semantic 

and agentic chunking strategies provide improved contextual alignment, their increased 

complexity and computational requirements highlight the trade-off between retrieval 

accuracy and processing efficiency [73]. 

Metadata Attachment 

Incorporating metadata into document chunks can contribute to enhancing retrieval 

performance in RAG systems, particularly in multi-document contexts [75]. Attaching 

metadata such as page numbers, document titles, authorship information, timestamps, and 

other relevant identifiers allows precise filtering and prioritize most recent information, 

thereby improving retrieval relevance and minimizing the potential for confusion between 

similar chunks originating from distinct documents [76]. Furthermore, artificially 

constructed metadata bridges the semantic gap between user queries and document 

content, like paragraph summaries or hypothetical questions generated by LLMs, 

resulting into more accurate responses [75,76]. Incorporating metadata annotations also 

add contextual layers to each chunk therefore improving the RAG system's capacity to 

retrieve and present coherent information from diverse sources therefore [19]. 


12 

Re-ranking 

Re-ranking models are essential in RAG systems, as they refine document retrieval by 

applying a secondary prioritization based on relevance (secondary to the similarity score) 

[19]. By reorganizing retrieved chunks, re-ranking ensures that the most relevant content 

appears at the top, optimizing the document pool that is provided to the language model 

[19,76]. This prioritization process can be rule-based, relying on metrics such as 

relevance, diversity, and mean reciprocal rank (MRR), or it can employ model-based 

methods leveraging advanced natural language processing models [19,76]. 

Re-ranking addresses the limitations of initial retrieval methods, which often prioritize 

similarity (e.g., through cosine similarity scores) without fully assessing relevance 

[19,76]. Advanced re-ranking models, like cross-encoders, are particularly effective in 

accurately scoring chunk relevance for a given query, often outperforming simpler bi-

encoder models in this domain [19]. However, these approaches are computationally 

intensive, especially those that use pre-trained language models (PLMs) [19,77]. 

Generative LLMs, such as GPT-3, can further enhance re-ranking by generating synthetic 

queries for domain-specific training, enhancing the accuracy of the relevance-based 

ordering without requiring vast amounts of new labeled data [77]. Despite their 

performance, these re-ranking models are resource-intensive, emphasizing a trade-off 

between retrieval precision and computational cost [19]. 

Context Compression 

Context compression is a technique in RAG systems to optimize performance and reduce 

inference costs. RAG systems retrieve relevant documents from an external datastore to 

augment a language model's response, but incorporating full documents as context can 

quickly lead to excessive token usage, exceeding the model's context length limits and 

increasing processing time [78]. Instead of simply concatenating numerous documents, 

context compression selectively simplifies information to minimize noise and highlight 

essential data, allowing the language model to focus on the most relevant content [78]. 

Several strategies for context compression have proven effective. One method employs 

small language models (SLMs) to filter out less important tokens in order to create a 

compressed prompt. Although the compressed result might seem disjointed to humans, it 

remains interpretable by LLMs and achieves compression without requiring further LLM 


13 

training. Other methods train information extractors to identify relevant content within 

large documents [19]. 

Combining document reduction with context compression further improves model 

accuracy [19]. The "Filter-Re-ranker" paradigm, for example, uses SLMs as filters and 

LLMs as re-rankers to prioritize relevant content. LLMs can also evaluate and critique 

retrieved content before generating a response, discarding irrelevant information and 

focusing on key details. These compression techniques are crucial for balancing 

relevance, token limits, and processing costs in RAG systems while maintaining language 

integrity [19]. 

Query Expansion and Re-writing 

In RAG systems, query expansion enhances retrieval accuracy by adding relevant context 

to the input query [19]. Hypothetical Document Embeddings (HyDE) achieves this by 

using LLMs to generate hypothetical contexts, which are then embedded with the original 

query to improve retrieval precision (Figure 3). This method is particularly beneficial 

when limited labeled data or explicit knowledge is available, enabling RAG systems to 

create richer embeddings by incorporating potential relevant details [79]. 

Figure 3  

Query Expansion Using HyDE 

 
However, query expansion using LLMs, including methods like HyDE, can sometimes 

introduce inaccuracies or hallucinations—where hypothetical content diverges from 

factual information—especially if the LLM lacks knowledge of the query topic. To 

address this, multi-query approaches expand the initial query into several targeted queries, 

while sub-query methods break down complex questions into simpler prompts [19]. 


14 

1.1.5 Agents in Large Language Models 

LLM agents are designed to autonomously perform tasks by leveraging the models' 

reasoning abilities [38]. These agents can interpret complex instructions and execute 

multi-step processes, acting as intermediaries between users and computational resources 

[39]. Agents demonstrated their adaptability and potential for transformative impact in 

various fields, including social sciences, natural sciences, and engineering [40]. However, 

challenges remain in maintaining consistent performance, ensuring accurate contextual 

understanding, and achieving seamless integration with external tools [38]. Ongoing 

research focuses on improving the agents autonomy, reliability, and human-like 

interaction [80]. 

Agentic RAG leverages LLM agents to enhance the retrieval of information and 

overcome traditional RAG systems limitations [15] by incorporating intelligent agents 

that dynamically select and process relevant information, thereby improving response 

accuracy and contextual relevance [35]. This approach enables the system to handle 

complex, multi-step reasoning tasks more effectively than naive RAG, which lacks such 

dynamic decision-making capabilities [37]. Furthermore , by leveraging agents with 

access to various tools, Agentic RAG can route queries to specialized knowledge sources, 

leading to more accurate and contextually appropriate responses [36]. In contrast, naive 

RAG systems directly use retrieved information without additional processing, which 

may result in coherence issues in generated responses [15]. 

1.2 Literature Review 

QA systems research began in the early 1960s, marking a significant area of study within 

natural language processing [81]. Early efforts often adapted rule-based methodologies, 

exemplified by the system proposed by Riloff and Thelen in 2000 [82]. However, their 

work [82] highlighted several inherent limitations of this approach, including the 

resource-intensive nature of rule creation, sensitivity to variations in wording and 

sentence structure, and limited ability to draw inferences. Further challenges arise from 

coreference resolution, contextual interpretation, and ambiguity management. Scalability 

is also a constraint, and errors from earlier processing stages can cascade, negatively 

impacting overall system performance. The manual crafting required by rule-based 

systems complicates their maintenance and updates, and their inability to generalize can 

lead to performance decline with larger datasets [83]. 


15 

In the pre-transformer era, machine learning and NLP techniques were combined in 

various models to address the challenges of rule-based systems. Poon et al.  [84] advanced 

machine reading by combining probabilistic reasoning with NLP to infer meaning from 

text, aiming to improve structured knowledge extraction from unstructured data using 

statistical methods. Lende and Raghuwanshi [85] similarly used NLP techniques like 

part-of-speech tagging, named entity recognition, and syntactic pattern matching to build 

question-answering systems for educational texts, focusing on interpreting and retrieving 

relevant information. 

IR techniques were employed for extracting relevant text segments from extensive 

document collections in addition to machine learning-driven methods in order to enhance 

the overall performance. Dwivedi and Singh [86] conducted a comprehensive review of 

question-answering systems showing how IR systems employed different techniques 

such as document indexing, keyword matching, and ranking algorithms to locate relevant 

information. The field was further advanced by the development of probabilistic IR 

models and the introduction of statistical methods for relevance estimation [87]. Early 

efforts, such as vector space models [88] and latent semantic analysis [89], laid the 

groundwork for more sophisticated retrieval systems. Despite their contributions, these 

early systems were often constrained by their reliance on complex NLP pipelines that 

required accurate linguistic annotations and handcrafted rules. This dependency not only 

increased the computational overhead but also limited their scalability and adaptability 

across diverse domains and languages, as emphasized by Lende and Raghuwanshi [85]. 

Since the introduction of the Transformer architecture [5] and the introduction of BERT 

[6], a few systems were introduced as extractive question answering systems [90–93]. 

The methodologies [90–93] involved fine-tuning BERT or its variants (like RoBERTa) 

on domain-specific datasets, often with enhancements like hybrid architectures, semantic 

layers, or de-biasing strategies to address specific challenges in extractive QA. Most focus 

on leveraging transformer-based embeddings for efficient answer extraction. A common 

set of challenges emerges across research on extractive question answering using BERT 

and its variants. Many models exhibit domain and dataset dependence, performing well 

only on datasets similar to those they are fine-tuned on, which limits their generalization 

ability to diverse or unseen contexts [91,94]. Bias issues, such as position bias, cause 

models to overly rely on the location of answers within passages rather than their content 

[92]. Additionally, advanced architectures and mechanisms for improving performance 


16 

often come with increased computational overhead, making them less practical for real-

world applications [90]. Language limitations are significant, particularly in low-resource 

languages, where models struggle due to insufficient pre-training data and linguistic 

complexity [91]. Finally, models often face challenges with contextual understanding, 

such as handling complex sentence structures or reasoning across multiple sentences, 

reducing their effectiveness in answering complex queries [90,93]. 

A methodology proposed by Alkhurayyif and Sait [95] for Arabic question answering 

involves data preprocessing, named entity relationship identification using the 

Multinomial Naïve Bayes algorithm and Named Entity Recognition (NER), and response 

retrieval leveraging ELMo embeddings and Quaternion Long Short-Term Memory 

(QLSTM) networks. However, the system faces several challenges, including sensitivity 

to real-world variability, difficulty handling complex Arabic sentence structures, limited 

support for non-textual queries, and inefficiencies in training time. The complex 

morphology of Arabic, particularly verb-noun patterns, hinders contextual understanding, 

and scalability remains a concern due to its design being tailored for Arabic-specific tasks. 

Building on the foundational advancements of BERT, LLMs emerged as a transformative 

innovation in NLP, characterized by their unprecedented scale and capability [96]. These 

models significantly enhanced their ability to understand and generate contextually 

relevant text, achieving remarkable performance across a wide array of NLP tasks [97]. 

The adoption of transformer-based architectures, as outlined by Gillioz et al. [98], laid 

the foundation for these advancements, enabling LLMs to handle tasks such as machine 

translation, text summarization, and sentiment analysis with unprecedented accuracy and 

scalability. The vast pre-training of LLMs on diverse datasets across multiple domains 

has provided the models with extensive general knowledge, making them highly effective 

for open-domain applications. Kamalloo et al.  [99] demonstrated LLMs' ability to 

provide contextually rich and accurate answers in open-domain question answering, 

leveraging their training on large datasets. However, despite their capabilities, LLMs are 

prone to hallucinations, particularly in specialized and closed-domain scenarios [100].  

While LLMs exhibit impressive capabilities, they are prone to hallucinations, particularly 

in specialized and closed-domain contexts. Fine-tuning with domain-specific data has 

been shown to enhance performance by improving knowledge representation and 

accuracy within those domains. For instance, Guo and Hua [101] employed continuous 


17 

training and instruction fine-tuning to adapt Meta's Llama 2 base models to the Chinese 

medical domain, achieving performance comparable to OpenAI’s GPT-3.5-turbo in 

medical question answering. Similarly, Singhal et al. [102] proposed fine-tuning LLMs 

with specialized medical datasets to improve their ability to address expert-level medical 

inquiries accurately. Huang et al. [103] further demonstrated this principle by introducing 

"Lawyer LLaMA", a legal domain LLM that incorporates domain knowledge through 

continuous training and acquires professional skills via supervised fine-tuning, effectively 

adapting the model to legal-specific queries. These fine-tuning approaches enhance 

domain accuracy, challenges remain, notably in supporting multilingual queries. Many 

LLMs are initially trained on primarily English datasets, which may limit their ability to 

handle languages with less training data [104]. The research by Xu et al. [105] discusses 

significant limitations in multilingual LLMs, such as language imbalance and 

multilingual alignment, which can decline the performance in low-resource languages.  

Additionally, LLMs require frequent updates to their knowledge base to keep up with 

rapid advancements and evolving information in specialized domains. Since the 

knowledge in an LLM is "learned" within its parameters, updating the model’s knowledge 

typically involves computationally expensive retraining or fine-tuning [72] and LLMs 

may struggle to integrate new factual knowledge effectively, instead reinforcing their pre-

existing knowledge, which can lead to increased hallucinations [106]. Chang et al. [107] 

highlight that adding multilingual data can improve low-resource language modeling 

performance, but as dataset sizes increase, adding more data may begin to decrease the 

performance due to limited model capacity, known as the "curse of multilinguality" 

[107,108]. This limitation poses significant challenge in making LLMs fully reliable and 

adaptable across diverse and dynamic specialized domains. 

RAG has emerged as a promising alternative to traditional fine-tuning approaches for 

addressing the limitations of parameter-based language models [15]. While fine-tuning 

involves adapting a model by updating its parameters on specific datasets, it is often 

constrained by the static nature of learned knowledge and the risk of forgetting. In 

contrast, RAG dynamically integrates external knowledge sources, allowing models to 

access and leverage up-to-date information beyond their initial training data [15,19]. 

Several question-answering systems have been proposed using RAG, demonstrating its 

adaptability. For example, the English question-answering system DPR-RAG [109] 

integrates dense passage retrieval to enhance the accuracy of generated responses, 


18 

offering significant improvements over traditional approaches. However, its reliance on 

dense embeddings can result in poor generalization to domains or languages with limited 

training data, and irrelevant passages retrieved during the process can degrade output 

quality. 

A research focusing on Islamic question answering is MufassirQAS [110], which employs 

RAG to enhance Arabic question answering, particularly within Islamic studies domain 

by integrating a vector database of Turkish-translated Islamic texts and prompt 

engineering to semantically search for relevant information and provided it to the LLM. 

Nonetheless, the authors didn't show any evaluation criteria, metrics, or scores in their 

research to reflect the system's performance, although they mentioned that the system's 

effectiveness is limited when faced with larger contexts. Integrating and summarizing a 

large number of context chunks can disrupt coherence and the flow of information, 

resulting in fragmented responses. This challenge suggests a need for improvements in 

connecting and synthesizing retrieved data to produce cohesive outputs. 

Existing research in QA systems demonstrates significant advancements but also reveals 

notable limitations. The MufassirQAS LLM [110] struggles with larger contexts, leading 

to fragmented responses and limited applicability to broader questions. Similarly, the 

Arabic QA system leveraging deep learning techniques [95] shows strong performance 

but suffers from scalability issues, inefficiencies in handling complex Arabic morphology, 

and a dependency on extensive training and fine-tuning processes. Lastly, there is a 

noticeable lack of generic QA systems that can seamlessly adapt to new knowledge bases 

without requiring significant reconfiguration or retraining, highlighting the limited 

flexibility and scalability of current approaches. 

1.3 Problem Statement 

LLMs like OpenAI's ChatGPT and Meta's Llama are often pre-trained on vast datasets of 

internet text enabling the models to learn world information, grammatical phrasing, 

vocabulary, and lingual context [111]. While these LLMs impress with their natural 

language abilities [9], they encounter distinct challenges regarding query understanding 

or generating response when applied in some specialized domains. The challenges are 

caused by: the broad dataset they were trained on, conflicting facts in the dataset, or 

outdated information, among other factors [24]. These challenges are even increased 

when the LLM is dealing with complex languages like Arabic [112].  


19 

The phenomenon of generating nonsensical, inaccurate, or factually-incorrect responses 

is known as "hallucination" [24]. In the field of question-answering, some sensitive 

domains like legal counseling or Islamic fatwa, such inaccuracies are not tolerated. In 

Islamic fatwa for instance, where interpretations of Qur’an and Hadith verses are crucial, 

even minor errors are unacceptable. Additionally, user queries may involve culturally-

specific and geographically-dependent matters. For instance, one may ask about the 

permissibility of using a specific local bank or investing in a regional company. Such 

inquiries require a deep understanding of local culture, including events and institutions. 

While current LLMs trained on localized data show some cultural awareness, they often 

lack in-depth comprehension, hindering accuracy in specialized topics [113]. Moreover, 

LLMs like ChatGPT and Gemini often use content filtering to avoid sensitive, ethical, or 

potentially harmful topics [114]. While intended for responsible use, these filters can limit 

the scope and effectiveness of LLMs for specialized applications like Fatwa inquiries, 

where culturally specific responses are crucial. 

Fine-tuning LLMs for specialized domains is the traditional solution. However, the 

enormous parameter counts of these models requires significant computational resources 

for fine-tuning. Robust LLM that can understand the text consists of billions of 

parameters, for example OpenAI's GPT-3 models have over 175 billion parameters, 

Google's PaLM has over 540 billion parameters [7], and Meta's Llama 3.1 family has 405 

billion parameters in some models [115]. Training the GPT-3 davinci model took over 3 

years and it cost about 4 million USD to train its 175B parameters. OpenAI's GPT-4 on 

the other hand, has trillions of parameters, took about 3 years and 6 months to train and 

it cost about 90 million USD given the development in GPU chips which became more 

computationally powerful [116]. 

Even in smaller models like Meta's Llama 3.1 8B, fine-tuning can be resource-intensive. 

Techniques such as PEFT, LoRA, and adapter modules reduce computational 

requirements for the fine-tuning process, yet the process can still be time-consuming on 

systems with limited resources and requires high end hardware [117]. For instance, full 

fine-tuning of Llama 3.1 8B typically requires approximately 16 GB of GPU memory, 

which is manageable on a high-end consumer GPU [115]. Implementing techniques like 

LoRA and adapter modules further reduces memory requirements, making fine-tuning 

feasible on GPUs with even lower memory capacities. However, these methods often 

introduce additional complexity due to the need for careful hyperparameter tuning, which 


20 

can impact performance if not optimized [118]. Additionally, these approaches may 

struggle to generalize beyond the specific tasks or domains they were fine-tuned for, 

limiting their flexibility [117]. Parameter-efficient fine-tuning also typically retains the 

original base model’s limitations, such as vulnerability to hallucination in contexts outside 

the fine-tuned scope [113]. 

Another concern in fine-tuning is the structure of parameterized LLMs makes it difficult 

to predict performance before the fine-tuning process is completed [15]. More 

importantly, modifying learned data remains an ongoing research challenge [120]. This 

often requires complete re-tuning, as there is currently no established method for LLMs 

to "unlearn" outdated information effectively [121]. Additionally, parameterized-memory 

LLMs lack the inherent capability to provide references for their generated responses, 

which limits their reliability in producing verifiable outputs [15,122]. 

This study introduces a generic Arabic question-answering framework capable of 

answering questions across various domains and adapting to any textual dataset without 

requiring fine-tuning. The framework processes questions posed in Arabic and generates 

a response consists of two key elements: 

1. Answer: the generated answer to the user's question.  

2. Evidence (if applicable): Supporting references, which may include Qur'anic or 

Hadith verses in the context of Islamic fatwa, or relevant legislation for legal 

counseling. This component is provided only when the question requires evidence to 

support the generated answer.  

1.4 Aims of Study 

This study aims to introduce Nebras, a novel Arabic question-answering system designed 

exclusively for specialized domains. Nebras addresses the unique requirements of Arabic 

language applications while providing a scalable, efficient, and context-aware solution 

for generating accurate, factually correct answers. The system’s design reflects several 

contributions that distinguish it from traditional systems. 

One of the key contributions of Nebras is its adaptability. The system’s knowledge base 

can be expanded and managed dynamically. Administrators can add new textual datasets 

tailored to specific domains, map relevant fields from the dataset for indexing and 


21 

retrieval, and modify the knowledge base by updating or deleting existing documents as 

needed. This design eliminates the need for costly and time-intensive model fine-tuning 

and allows Nebras to seamlessly adapt to various domains while maintaining an up-to-

date and relevant knowledge base. 

Nebras employs a hybrid retrieval approach that combines multiple techniques to ensure 

relevant and accurate information retrieval. Its implementation leverages the Agentic 

RAG framework where specialized agents collaborate to process tasks effectively. 

Nebras addresses the unique challenges of the Arabic language by prioritizing linguistic 

and contextual precision. This study demonstrates its potential to advance question-

answering for Arabic-speaking users in specialized domains by proposing a robust, 

scalable, and adaptable solution to meet diverse application needs. 

1.5 Hypotheses of Study 

The following hypotheses have been formulated to guide the research, evaluate the 

effectiveness of Nebras, and address the challenges identified in the problem statement. 

Nebras is a novel Agentic RAG-based system introduced in this study, designed for 

complex Arabic question answering across specialized domains. 

1.5.1 Accuracy Hypothesis 

Nebras is expected to outperform existing LLMs in providing accurate and contextually 

relevant answers to Arabic queries in specialized domains. Nebras enhances response 

reliability and reduces hallucinations (factually incorrect yet plausible-sounding answers) 

by dynamically integrating external textual knowledge bases. Nebras's performance will 

be compared to a baseline established by evaluating responses generated by several top-

performing models, using both automatic metrics and expert evaluations on domain-

specific datasets. 

1.5.2 Adaptability and Scalability Hypothesis 

Nebras's dynamic knowledge base management system allows seamless expansion and 

modification which enables the system to incorporate new datasets for emerging domains 

without fine-tuning. Nebras offers a scalable and cost-effective Arabic question-

answering solution deployable even in resource-constrained environments by leveraging 


22 

pre-trained LLMs. This will be validated by analyzing Nebras’s computational efficiency 

(memory and processing requirements) and its scalability to large datasets. 

1.5.3 Language-Specific Performance Hypothesis 

Nebras is hypothesized to outperform existing systems in handling the linguistic 

complexities of Arabic (morphology, syntax, and dialectal diversity). Its design is 

expected to give it an advantage over general-purpose LLMs when processing complex 

Arabic queries. This will be evaluated using both lexical and semantic metrics, and 

contextual relevance ratings. 

  
23 

Chapter Two 

Methods 

This chapter details the development and implementation of the Nebras, explaining the 

design choices, algorithms, and frameworks used to process Arabic queries across 

multiple domains. It covers the hybrid retrieval approach, Agentic RAG framework 

integration, use of pre-trained LLMs, and the system's dynamic knowledge capabilities. 

Subsequent sections explore implementation aspects, highlight its key components, and 

clarify the system’s operation. 

2.1 Data Collection 

This research focuses on two distinct domains: Islamic Fatwas and University 

Information Help Desk. Each domain presents unique challenges and requires specific 

data structuring. The Islamic Fatwa domain, due to its sensitive nature and the need for 

accurate, well-supported answers, demands careful handling. Fatwas require precise 

reasoning and the inclusion of correct and specific daleel (evidence) aligned with Islamic 

jurisprudence. This daleel, often from the Qur’an or Hadith, must be accurately 

referenced; the system must not generate or fabricate it. This makes the domain an ideal 

choice for testing the system’s ability to generate contextually accurate responses with 

necessary supporting information. Additionally, this domain allows for the evaluation of 

the system’s performance with question-answer data structures, as fatwa data inherently 

follows this format. 

Data for the Islamic Fatwa domain is sourced from two reputable websites: 

1. NNU Fatwa Website (https://fatwa.najah.edu/): Maintained by the Faculty of Shari'a 

at An-Najah National University, a respected institution in Islamic scholarship, this 

website contains over 1,500 fatwas. These address local issues and inquiries specific 

to Palestine, making the data highly relevant and credible. 

2. Islamweb.net (https://islamweb.net): Overseen by the Ministry of Endowments and 

Islamic Affairs in Qatar, Islamweb.net offers a vast repository of over 160,000 fatwas. 

Its credibility and reliability make it a valuable resource for this research. 

https://fatwa.najah.edu/
https://islamweb.net/


24 

These carefully selected sources and their credible data sources ensure a robust evaluation 

of the proposed system’s ability to handle diverse and complex Arabic queries, 

contributing to the development of Nebras, the proposed question-answering framework. 

2.1.1 Islamic Fatwa Dataset Collection 

To gather data from these sources, a custom scraping template is developed for 

Islamweb.net to capture specific fields essential for the system responses. These fields, 

as explained in Table 1 and highlighted in Appendix D - Figure D11, include categories, 

fatwa topic summaries, unique fatwa identifiers, dates, questions posed, and the detailed 

answers provided by muftis. Each entry is appended with a static "source" field with the 

value "islamweb.net" to facilitate data source tracking. 

Table 1  

Islamweb Fatwa Fields Mapping  

Field Description Field Name Data Type 

1 Fatwa categories can be extracted from fatwa 

breadcrumbs. 

categories Array 

2 Fatwa topic. A short summary of what this is about. topic string 

3 Fatwa id. fatwa_id integer 

4 Fatwa date in Gregorian and Hijri. date string 

5 Fatwa question asked by a user. question string 

6 Fatwa answer from muftis at islamweb. answer string 

 
The initial scraping of Islamweb.net resulted in a dataset of 164,310 fatwas before pre-

processing. For the NNU Fatwa website, similar scraping techniques were used to obtain 

1,500 locally relevant fatwas, enriching the dataset with inquiries specific to the 

Palestinian context. 

2.1.2 University Help-desk Dataset Collection 

The dataset for the university help-desk dataset is collected exclusively from NNU and 

comprises information specific to NNU. The NNU dataset, by contrast, belongs to a 

different knowledge domain, focusing on academic and institutional information 

represented in a document-based, semi-structured format rather than a question-answer 

structure. This dataset includes factual information about majors, courses, admission 

requirements, and faculty members, allowing the evaluation of the ability of the proposed 


25 

system to retrieve and generate accurate responses from semi-structured documents. The 

process of collecting data for the NNU dataset involves parsing of web pages from the 

university's official site (https://najah.edu), with additional details on the scraping 

techniques and categorization to be explained in the subsequent subsections. 

However, the NNU website’s protection measures prevent automated data scraping, 

making it difficult to automatically gather information across all academic programs. 

Consequently, a manual data collection process is implemented for a selected group of 

medical and IT-related majors. Due to the time-intensive nature of manually gathering 

data for over 200 majors, this selective approach is considered necessary for the research. 

Academic Data Collection 

The process of collecting academic data focused on obtaining detailed information 

regarding the academic majors available at An-Najah National University at both the 

undergraduate and graduate levels. This data includes details for each major, covering its 

title, affiliated faculty, academic degree (e.g., Bachelor's, Master's, Doctorate), duration 

of study, and a corresponding URL. The specific data fields extracted during the 

collection process are illustrated in Table 2  and highlighted in Appendix D - Figure D12. 

Table 2 

NNU Academic Majors Fields Mapping 

Field Description Field Name Data Type 

1 Academic major title major_title string 

2 Major's faculty faculty string 

3 Academic degree degree Enum (bachelors, master, doctorate) 

4 Major's study duration duration string 

5 Major's info URL url url 

6 Document type doc_type string 

 
The description of each academic major is displayed on its own dedicated page, 

necessitating a visit to the page for data extraction. For selected majors in medical and 

IT-related fields, as a sample, these academic major pages were manually visited to 

retrieve their descriptions. While visiting each academic major's page, information related 

to the major curriculum and courses is collected. Table 3 shows information about the 

collected fields. The fields are also illustrated in Appendix D - Figure D13 for further 

reference. The "doc_type" field holds the document's category and set manually while 

https://www.najah.edu/ar/academic/undergraduate-programs/
https://www.najah.edu/ar/academic/graduate-programs/


26 

collecting the data. The values used to distinguish the documents are: "admission", "staff", 

"academic_major", and "academic_course" for majors, courses, staff, and admission 

documents, respectively. 

Table 3  

NNU Academic Courses Fields Mapping 

Field Description Field Name Data Type 

1 Plan version version string 

2 Curriculum section section string 

3 Course number course_number integer 

4 Course title title string 

5 Credit hours credit_hours string 

6 Course prerequisites prerequisites string 

7 Course description description string 

8 

 
Document type doc_type Enum (academic_major, 

academic_course, staff, 

admission) 

 
Administration-related Data Collection 

The collection of administrative data is limited due to the same protection measures 

mentioned earlier, information regarding the university's governance and faculty deans is 

manually obtained. Some information regarding admission and acceptance is found on 

Nawarat An-najah (https://nawarat.najah.edu/), a subdomain for NNU for newly 

registered students. 

By preparing these two datasets, the study can assess the QA system's performance in 

generating answers in different domain with different dataset structures. 

2.2 Data Pre-processing and Structuring 

Effective data structuring is essential for compatibility with the proposed system. Due to 

the distinct nature of the Islamic Fatwa and NNU datasets, each requires specific 

processing steps for organization and standardization. The following subsections detail 

the structuring methodologies for each dataset. 


27 

2.2.1 Islamic Fatwa Dataset Pre-processing and Structuring 

This section details the cleaning and preprocessing steps applied to the Islamic fatwa data. 

These steps removed noise, inconsistencies, and irrelevant information to enhance the 

dataset's quality. The following subsections provide a comprehensive overview. 

Missing Values and Exploratory Data Analysis 

Data completeness is assessed by checking for missing values (NaN, whitespace-only 

values, or zeros in integer columns). The only column with missing values was the "topic" 

column with count of 12 missing values. To mitigate the impact of missing data on 

subsequent analysis, these missing values were replaced with human-generated 

descriptions derived from a careful analysis of the corresponding fatwa's context. To gain 

insights into the textual content of the dataset, an exploratory data analysis (EDA) is 

conducted on the "topic", "question", and "answer" fields. The text in these fields is 

tokenized by splitting it into individual words using single space delimiter, allowing for 

an analysis of token counts. Table 4 presents the maximum, minimum, and mean token 

counts for each field. 

Table 4  

Token Counts from the Scraped Islamweb Dataset 

Field Max. Count Min. Count Mean 

topic 50 1 7.68 

question 2.191 2 78.39 

answer 6,076 2 211.44 

 
The "question" and "answer" columns exhibited a minimum token count of 2, which 

raised concerns about the potential presence of very short or incomplete entries. Further 

inspection of these low-token-count entries was postponed until after the text cleaning 

process. 

Muftis often begin fatwas with introductory sentences that do not contribute directly to 

the factual content or reasoning of the answer. The length of these messages can influence 

the chunking process and potentially impact the retrieval process. 

To extract introductory sentences, the line breaks in the scraped fatwas are used. These 

sentences are generally found in the first line, although this pattern is not consistent across 


28 

all fatwas. By examining the first lines, the frequent introductory sentences are extracted 

through the following steps: 

− Splitting each fatwa by line breaks and the HTML <br> tag. 

− Extracting the first non-empty text result from the split array, excluding empty HTML 

tags or whitespace. 

− Calculating the frequency of each extracted sentence across the dataset. 

This process identified a total of 17,034 unique introductory sentences. The sentence 

بعد“ أما  آله وصحبه   appears most frequently in the ”الحمد هلل والصالة والسالم على رسول هللا وعلى 

entire dataset, with a total of 138,281 occurrences (84%). The second most frequent 

introductory sentence is “ أما بعد   ومن وااله  وعلى آله وصحبه  محمد  نبيناالحمد هلل والصالة والسالم على   ” 

which appears is 4,241 fatwas but have the same leading and trailing words. 

Text Cleaning 

The text cleaning process involved several steps: stripping HTML tags, removing non-

Unicode characters, and eliminating duplicate tokens. Line breaks were preserved at this 

stage, as they might be beneficial in later processing steps. 

Removing Introductory Sentences 

Introductory sentences in fatwas do not contribute directly to the factual content or 

reasoning of the answer and usually follow a structural pattern. To remove them without 

affecting the context or valuable information, the process involves: extracting starting and 

trailing words from each introductory sentence to construct a pattern, identifying the first 

sentence in each fatwa, and removing it if it matches the pattern. Further analysis of these 

words, as detailed in Table 5, reveals that “الحمد” is the most frequent starting word in 

introductory sentences (98.5%). The trailing words for sentences beginning with "الحمد" 

are also examined, with results presented in Table 5. 

 
29 

Table 5  

Frequency of Start and Trailing Words in Introductory Sentences in Fatwa 

Word Frequency Percentage 

Starting Words 

 %98.5 160,870 الحمد

 %1.39 2,262 خالصة 

 %0.054 89 الخالصة 

Trailing Words for Sentences Strating with “ الحمد” 

 %51.8 84,580 بعد

 %35.9 58,699 بعـد

 %8.45 13,814 أعلم

 %1.39 2,268 الفتوى 

 %0.36 580 وبعد 

 
After removing the introductory sentences, the frequency of the starting words is 

recalculated and displayed in Table 6. 

Table 6  

Frequency of First Words After Removing Introductory Sentences in Fatwa 

Word Frequency Percentage 

 %16.67 27,234 فإن

 %10.65 17,403 الحمد

 %8.99 14,680 فقد

 %8.47 13,837 فال

 
Although the word " الحمد" maintains a relatively high frequency, likely it is part of the 

fatwa answer, completely eliminating it could potentially compromise the accuracy of the 

factual context within the fatwa. 

Subsequent Text Cleaning and Re-evaluation 

Following the removal of introductory messages, the dataset is re-evaluated for missing 

values and token counts to assess the impact of text cleaning steps on token counts. The 


30 

evaluation showed no missing values in any dataset columns. The results for token counts 

are presented in Table 7. 

Table 7 

Token Counts from the Scraped Islamweb Dataset After Cleaning 

Field 
Maximum Count Min. Count Reduction 

Before Cleaning After Cleaning 

topic 50 50 1 0 

question 2.191 2,119 1 3.29% 

answer 6,076 5,436 3 10.53% 

 
A manual review of questions comprising only a single word revealed a typographical 

error in one query. The original, malformed question was: 

 قال.هللا.أالتزر.وازرة.وزر.أخرى.فكيف.يتفق.ذلك.مع.الشفاعة.ومع.إهداء.ثواب.األعمال.إلى.األموات؟

Given the context and grammatical structure, it is clear that this query represents a user 

input error. The question was subsequently corrected to accurately reflect the intended 

meaning by removing the periods and replace them with spaces. 

For answers with fewer than 10 words, it was observed that some referenced other fatwas 

using phrases like:  

     .10794 ورقم:، 1640 سبق برقم: −

  .فتراجع 32689: فتفصيل ذلك سبق في الفتوى رقم  −

This suggests that extracting fatwa references from answer texts could be valuable. 

Regular expressions can be employed to identify these reference numbers. 

The references 1640, 10794, and 32689, mentioned in the short fatwa answers, were not 

found within the dataset and are subsequently removed. 

HTML Tag Removal and Text Formatting 

An analysis of randomly selected fatwas revealed that HTML tags were primarily used 

for text formatting and coloring, without a standardized structure for different fatwa 

components such as daleel (evidence) or references. This inconsistency made it 

impractical to leverage HTML formatting for information extraction. 


31 

Category Extraction 

The final step in text pre-processing involved categorizing the fatwas to provide a 

structured taxonomy for subsequent analysis and retrieval. The category names were 

extracted from the fatwa breadcrumbs (Field 1 in custom scraper template), and the word 

 .is removed to streamline the categorization process (main) ”الرئيسية“

The first two words of the category names are retained to represent the main and 

secondary categories, respectively. This hierarchical categorization approach provides a 

more granular understanding of the fatwas' topics. The distribution of fatwas across the 

identified categories is visualized in Appendix E - Figure E14. 

2.2.2 NNU Dataset Pre-processing and Structuring 

The following subsections outline the structuring process for each NNU document type, 

including academic majors, academics, courses, staff, and admissions. 

Academic Majors Document Structuring 

For academic major documents, additional data from other scraped fields is merged into 

the “description field” (as shown in Table 2) to create an entry suitable for indexing. These 

fields include “major_title”, “faculty”, “degree”, and “duration”. By enhancing the 

“description” with relevant information from these fields, it becomes more concise and 

informative. The resulting enriched text is stored in a newly created field, "content," 

specifically designed for indexing and similarity searches. The template for the "content" 

field is provided in Appendix F. 

Staff and Admission Document Structuring 

No augmentation is applied to these documents; their content is indexed directly in the 

vector database. Staff documents consist solely of staff member biographies, while 

admission documents outline the admission rules for new student enrollment. 

Generating “topic” for NNU Documents 

A "topic" field is added to each collected document, providing a brief summary about the 

document’s content. This summary is generated using a LLM with the following prompt:  

 .التركيز الرئيسي للمستندحلل المستند التالي واستخرج وصًفا قصيًرا للموضوع الذي يلخص الفكرة أو “

 .كلمة، ويعكس الموضوع األساسي للنص  50-20يجب أن يكون الوصف موجًزا، ويفضل أن يتراوح بين 


32 

 {text}“   إليك المستند

Which translates to: “Analyze the following document and extract a brief description of 

the topic that summarizes the main idea or focus of the document. The description should 

be concise, ideally between 20-50 words, and should reflect the core subject of the text. 

Here is the document: {text}” 

The “topic” field is also indexed in the vector database, resulting in each document having 

two embedding vectors: one for the content field and another for the “topic” field. 

The data structuring process standardizes and optimizes both the Islamic fatwa and NNU 

datasets for seamless integration with the QA system. By tailoring the structuring methods 

to the unique requirements of each dataset, this process ensures that the system can 

effectively handle both question-answer data as well as document-based datasets. The 

structured datasets, with enriched content fields, facilitate accurate and efficient similarity 

searches and response generation, setting a solid foundation for the subsequent stages of 

system experimentation and evaluation. 

2.3 Implementation 

The following subsections delve into knowledge base preparation and the answer 

generation pipeline. 

2.3.1 Vector Database 

To ensure adaptability across various textual datasets, a mapping component is proposed 

for dataset preparation in the QA pipeline. Textual datasets serving as knowledge bases 

for question answering can be categorized into: 

1. QA-based: Datasets comprising paired questions and answers.  

2. Document-based: Datasets containing documents on specific topics (e.g., PDFs).  

For QA-based datasets, essential fields include: 

− Document ID: Unique identifier for each document. 

− Question: The posed query. 

− Answer: The corresponding correct answer. 

− Topic: A concise description of the question's subject.  


33 

The “question”, “answer”, and “topic” fields are vectorized and indexed in the vector 

database, while additional fields serve as metadata. For document-based datasets, 

required fields are: 

− Document ID: Unique identifier for each document. 

− Content: Full document content. 

− Topic: A brief summary of the document's content. 

The “content” and “topic” fields are vectorized and indexed while any extra fields act as 

metadata. 

In regards to the vector database collection, each collection must have the following 

metadata fields: 

− Title: A human-readable collection name. 

− Description: Summary of the collection’s content. 

− Type: Dataset type ("qa" or "docs"). 

− ID Field: Field name for uniquely identifying documents. 

A collection’s metadata example: 

{  

"name": "nnu",  

"title": "An-Najah National University",  

"description": "This collection includes comprehensive documents about An-Najah    

National University, covering admission processes, academic programs, and general 

university details.",  

"type": "docs",  

"id_field": "document_id"  

} 

By providing the necessary metadata, the system can efficiently manage and integrate 

new collections into its knowledge base. Users need only map relevant fields and provide 

collection metadata to incorporate datasets seamlessly as illustrated in Figure 4. 


34 

Figure 4  

Dataset Field Mapper 

 
2.3.2 Indexing and Chunking 

During the indexing phase, documents are processed, segmented, and converted into 

embeddings, which are subsequently stored in a vector database. The quality of the index 

construction is critical in determining whether the appropriate context can be accurately 

retrieved during the retrieval phase [19].  

The indexing structure introduced in this research utilizes a Hierarchical Index Structure 

with attached metadata. This hierarchical approach addresses the issue of context 

mismatch that arises when retrieved chunks are semantically incomplete.  

Determining an optimal chunk size is a delicate process that requires balancing 

considerations. Chunks that are too long introduce noise to the embedding model and 

requires more processing. Additionally, if a chunk exceeds the model’s maximum input 

length, it will be truncated, leading to loss of meaning. In contrast, chunks that are too 

short may prevent the embedding model from properly capturing the context. 

Incorporating a hierarchical index enhances retrieval allows the model to reconstruct the 

context [15,19]. 

The process Involves the following steps: 

− Assigning a unique identifier to each document (if not already provided). 


35 

− Split the documents into smaller, fixed-size chunks. Mapping each chunk to its 

originating document, thereby creating a parent-child relationship (hierarchy). 

− Attaching metadata to documents enhances the filtering process and enriches the 

document’s content. 

Achieving an appropriate chunk size involves a series of tests across different models and 

chunk sizes to find an effective balance which is illustrated in “Chunking Optimization” 

section. 

2.3.3 QA Pipeline 

Traditional RAG frameworks operate in two main steps: retrieving relevant documents 

and generating responses based on these documents [15]. While effective for 

straightforward queries, standard RAG often lacks the capacity for complex reasoning or 

task decomposition across multiple nodes [36]. 

Nebras’s QA pipeline is implemented using an Agentic Retrieval-Augmented Generation 

approach leveraging a Graph-workflow framework for structuring nodes within the 

pipeline, where each node (or agent) has a very specific task. The framework organizes 

the workflow by enabling different agents and models at each step, handling discrete tasks 

rather than dynamically retrieving graph-based data. The pipeline workflow is illustrated 

in Figure 5. 

The pipeline utilizes a series of specialized agents to ensure the reliability and precision 

of the responses. The process begins with Query Decomposition Agent, where complex 

questions are handled by breaking them down into smaller, well-defined, and concise sub-

questions. These are then processed by the Query Classifier Agent to determine the type 

of question being asked. Next, the Candidate Answer Generation agent formulates 

potential answers, which are further refined by the Retriever to identify the most relevant 

documents. The Context Relevancy agent assesses the retrieved information to ensure its 

applicability to the query. Finally, the Answer Generation agent constructs the final 

response, and the Evidence Extraction agent provides the answer with supporting 

evidence from provided text (if any), ensuring the response is both accurate and well-

supported by authoritative references 


36 

Figure 5  

QA Agentic RAG Pipeline 

 
The following sections delve into each agent within the QA pipeline, outlining their 

individual roles in the pipeline. 


37 

Query Decomposition Agent 

The query decomposition agent aims to decompose complex user queries into more 

focused sub-queries. This process involves breaking down the original query into smaller 

parts, rephrasing them for clarity and conciseness, and correcting any spelling or 

grammatical errors. The agent is specifically designed to rephrase user queries into a 

format that aligns with the systems knowledge domains. For instance, a query such as: 

ما حكم الربا؟ كيف أقدم على تخصص علم الحاسوب؟ شحال الثمن ديال هاذ الكتاب؟ ما هي شروط القبول  

 في الجامعة؟

would be decomposed into the following sub-queries: 

 ما حكم الربا؟ -

 كيف أقدم على تخصص علم الحاسوب في جامعة النجاح الوطنية؟ -

 ما هو سعر هذا الكتاب؟ -

 ما هي شروط القبول في جامعة النجاح الوطنية؟ -

To achieve the dynamic purpose of the system, the title field from the collections’ 

metadata will be passed to the prompt in order to decompose the queries accordingly. 

Assuming the system has two collections with titles: Islamic Fatwa, and An-najah 

National University. The knowledge domain topics will be embedded in the prompt and 

passed to the system. The model is prompted with the Arabic prompt referenced in 

Appendix G. 

This process ensures that the generated subqueries are semantically equivalent to the 

original query while being more suitable for subsequent agents. 

Query Classifier Agent (Query routing) 

The classification agent classifies the decomposed queries based on their relevance to the 

knowledge base's specialized domains. This agent uses collections’ metadata (name, title, 

and description), which is retrieved, reformatted as a structured string, and included in 

the LLM prompt. The output is a JSON object pairing each decomposed question with its 

classified collection. The specific collection’s metadata fields and their transformation 


38 

into string templates to facilitate the prompt for the classification agent is provided in 

Appendix H. 

Queries are classified as "irrelevant" if they don't align with available domains. The 

classification results, stored as a JSON object with "question" and "collection" keys, are 

easily accessible to other processes. The classification agent serves a dual purpose: it 

isolates irrelevant queries and determines the most appropriate collection for retrieving 

relevant documents. 

Candidate Answer Generation Agent 

The Candidate Answer agent generates potential answers for relevant queries (those not 

filtered by the Query Classifier Agent). It uses the HyDE technique for query expansion 

by generating an initial answer from a large language model without context. This initial, 

potentially inaccurate response helps retrieve more relevant documents. To maximize 

accuracy and relevance, the LLM is instructed to provide clear, concise answers to Islamic 

fatwa questions, excluding Hadith or Qur’an verses due to their sensitive nature. The 

prompt for candidate answers generation is referenced in Appendix I. 

Retriever Agent 

In the retrieval process, the decomposed relevant queries and the generated candidate 

answers are employed to retrieve relevant documents. 

In this research, the proposed hybrid retrieval approach integrates four similarity search 

techniques, each employing a Hierarchical Index structure. These methods retrieve the 

parent document, along with its attached metadata. A soft reminder that each document is 

represented by multiple embedding vectors to capture different aspects of its content. The 

similarity search techniques are: 

1. query-to-answer (query 
𝒔𝒊𝒎
→   answer): similarity search between the query and the 

indexed answer. 

2. HyDE (query + candidate answer 
𝒔𝒊𝒎
→   answer): combining the user’s query with the 

candidate and answer forming a new query for similarity search against the indexed 

answer. 

3. query-to-question (query 
𝒔𝒊𝒎
→   question): similarity search between the query and the 

indexed question. 

4. query-to-topic (query  
𝒔𝒊𝒎
→    topic): similarity search between the query and the 

indexed topic. 


39 

If the dataset is not question-answer based, the same techniques are used, but without 

employing the query-to-question approach. Alos, instead of matching against the answer 

field, the similarity search is performed on the “content” field.  

Each retriever returns the top-k most similar documents based on its retrieval criteria. The 

results from all retrievers are then combined into a single collection and filtered. This 

filtering process is guided by a ranking score, which is calculated using a ranking model, 

and the document frequency within the combined collection. The relevance score for each 

document is computed as the sum of its ranking score and its normalized frequency score. 

The final selection involves choosing the top-k documents based on a predetermined 

similarity threshold applied to the relevance score. Figure 6 illustrates the retriever agent 

pipeline, with the final output consisting of an array of documents. These documents 

retain the same structure and fields as the original input documents, aligned according to 

the field mapping defined by the user. 

Context Relevancy Agent 

The retrieved documents from the Retriever Agent often contain multiple documents 

related to the input query. These documents may include sentences that are not directly 

relevant to the answer, potentially influencing the model’s response. To address this, the 

retrieved documents are passed to a long-context language model to identify and extract 

the most relevant sentences. The prompt is referenced in Appendix J. 

The agent returns an array of the most relevant sentences related to the query, applying 

context compression to ensure the answer generation model focuses more effectively on 

the most relevant information. 

Answer Generation Agent 

The answer generation agent receives the decomposed relevant query along with relevant 

sentences identified by the context relevance agent. These inputs are integrated into a 

carefully crafted system prompt that defines strict guidelines for generating responses. 

The agent ensures that responses are clear, concise, and respectful, avoiding repetition of 

the query and maintaining a formal tone. When no context is available, it politely 

acknowledges its inability to provide an answer. This approach ensures that the agent 

delivers contextually relevant responses. The prompt for answer generation is provided 

in Appendix K. 


40 

Evidence Extraction Agent 

The Evidence Extraction Agent focuses on extracting Islamic daleel or legal references 

for legal queries. A prompt written in Arabic is passed to the LLM to identify and extract 

relevant Quranic verses, hadith verses, scholarly references, and legal references (such as 

article numbers) that support the generated response. The used Arabic prompt is 

referenced in Appendix L. 

In the QA pipeline, the workflow iteratively calls agents Candidate Answer Generation 

Agent  through to Evidence Extraction Agent until all relevant decomposed queries have 

been addressed. 

Irrelevant Queries Response Agent 

The Irrelevant Queries Response Agent is called if the user’s query is deemed irrelevant 

to the knowledge base. This agent informs the user about the system’s capabilities and 

provides guidance on formulating appropriate queries in case the asked question was 

about a topic beyond the system’s knowledge. By leveraging the collection’s title 

metadata stored in the vector store, the agent can effectively communicate the system’s 

areas of expertise. To ensure a respectful and informative interaction, the agent is 

designed to respond politely to inappropriate user input, such as profanity or harmful 

questions. While safeguard models were tested (like Meta’s Llama Guard 3) to detect 

harmful content, challenges were encountered in identifying profane words, particularly 

in Arabic. Additionally, certain questions related to Islamic fatwa, such as marital issues, 

were mistakenly clas