An-Najah National University Faculty of Graduate Studies NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR ISLAMIC AND LEGAL GUIDANCE By Samer Nitham Al-Huwari Supervisor Dr. Hamed Abdelhaq This Thesis is Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Artificial Intelligence, Faculty of Graduate Studies, An-Najah National University, Nablus - Palestine. 2025 II NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR ISLAMIC AND LEGAL GUIDANCE By Samer Nitham Al-Huwari This Thesis was Defended Successfully on 27/02/2025 and approved by Dr. Hamed Abdelhaq Supervisor Signature Prof. Mohammed Awad External Examiner Signature Dr. Ajmad Hawwash Internal Examiner Signature III Dedication To my incredible family—my parents and siblings—whose unwavering love, patience, and encouragement have been a constant source of strength throughout this journey. Your faith in me has been my greatest motivation, and your words and have guided me forward. To Dr. Hamed, my mentor and guide, who not only supervised this research but also opened doors to opportunities I never imagined. Your belief in my potential have shaped my academic journey profoundly. Lastly, to myself, for demonstrating unwavering determination and confronting this journey with courage. This achievement stands as a testament to resilience, perseverance, and an enduring belief in the promise of brighter days ahead. IV Acknowledgements I would like to express my heartfelt gratitude to all those who contributed to the success of this research. My sincere thanks go to the Palestinian Dar Al-Ifta’a for their invaluable assistance in evaluating the fatwas generated by the system, providing critical insights that greatly enriched this work. I am also deeply thankful to An-Najah National University for providing a sample of Islamic fatwa related to the local Palestinian community, which served as a resource for implementing the Islamic Fatwa answer generation. Without the support and collaboration of these esteemed institutions, this research would not have been possible. V Declaration I, the undersigned, declare that I submitted the thesis entitled: NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR ISLAMIC AND LEGAL GUIDANCE I declare that the work provided in this thesis, unless otherwise referenced, is the researcher’s own work, and has not been submitted elsewhere for any other degree or qualification. Student's Name: Samer Nitham Al-Huwari Signature: Date: 27/02/2025 VI List of Contents Dedication ....................................................................................................................... III Acknowledgements ......................................................................................................... IV Declaration ....................................................................................................................... V List of Contents ............................................................................................................... VI List of Tables ................................................................................................................ VIII List of Figures ................................................................................................................. IX List of Appendices ........................................................................................................... X Abstract ........................................................................................................................... XI Chapter One: Introduction and Theoretical Background .................................................. 1 1.1 Theoretical Background .............................................................................................. 5 1.1.1 Question Answering ................................................................................................. 5 1.1.3 Prompt Engineering ................................................................................................. 7 1.1.4 Retrieval Augmented Generation ............................................................................. 8 1.1.5 Agents in Large Language Models ........................................................................ 14 1.2 Literature Review ..................................................................................................... 14 1.3 Problem Statement .................................................................................................... 18 1.4 Aims of Study ........................................................................................................... 20 1.5 Hypotheses of Study ................................................................................................. 21 1.5.1 Accuracy Hypothesis ............................................................................................. 21 1.5.2 Adaptability and Scalability Hypothesis ................................................................ 21 1.5.3 Language-Specific Performance Hypothesis ......................................................... 22 Chapter Two: Methods .................................................................................................... 23 2.1 Data Collection ......................................................................................................... 23 2.1.1 Islamic Fatwa Dataset Collection .......................................................................... 24 2.1.2 University Help-desk Dataset Collection .............................................................. 24 2.2 Data Pre-processing and Structuring ........................................................................ 26 VII 2.2.1 Islamic Fatwa Dataset Pre-processing and Structuring ......................................... 27 2.2.2 NNU Dataset Pre-processing and Structuring ....................................................... 31 2.3 Implementation ......................................................................................................... 32 2.3.1 Vector Database ..................................................................................................... 32 2.3.2 Indexing and Chunking .......................................................................................... 34 2.3.3 QA Pipeline ............................................................................................................ 35 Chapter Three: Experimentation and Results ............................................................ 42 3.1 Indexing and chunking .............................................................................................. 43 3.2 Implementation ......................................................................................................... 47 3.3 Experiment with Islamic Fatwa Dataset ................................................................... 47 3.3.1 Baseline Evaluation: Responses Without Retrieval ............................................... 47 3.3.2 Context Effectiveness on Fatwa Answer Generation............................................. 49 3.3.3 Employing RAG .................................................................................................... 52 3.4 Experiment with An-najah National University Dataset .......................................... 60 3.4.1 Baseline Evaluation: Responses Without Retrieval ............................................... 61 3.4.2 Hybrid Retrieval Pipeline Evaluation with NNU Dataset ..................................... 62 Chapter Four: Discussions and Conclusion .................................................................... 64 List of Abbreviations ...................................................................................................... 66 References ....................................................................................................................... 67 Appendices ...................................................................................................................... 77 ب ................................................................................................... الملخص VIII List of Tables Table 1: Islamweb Fatwa Fields Mapping ...................................................................... 24 Table 2: NNU Academic Majors Fields Mapping .......................................................... 25 Table 3: NNU Academic Courses Fields Mapping ......................................................... 26 Table 4: Token Counts from the Scraped Islamweb Dataset .......................................... 27 Table 5: Frequency of Start and Trailing Words in Introductory Sentences in Fatwa .... 29 Table 6: Frequency of First Words After Removing Introductory Sentences in Fatwa .. 29 Table 7: Token Counts from the Scraped Islamweb Dataset After Cleaning .................. 30 Table 8: Embedding Models Evaluation ......................................................................... 46 Table 9: Summary of Implementation Technologies ...................................................... 47 Table 10: Palestinian Dar Al Ifta'a Evaluation of Generated Fatwas .............................. 50 IX List of Figures Figure 1: Traditional RAG Pipeline .................................................................................. 9 Figure 2: Two-Level Hierarchical Structure Indexing for One Document ..................... 10 Figure 3: Query Expansion Using HyDE ....................................................................... 13 Figure 4: Dataset Field Mapper ...................................................................................... 34 Figure 5: QA Agentic RAG Pipeline .............................................................................. 36 Figure 6: Hybrid Retrieval Pipeline ................................................................................ 41 Figure 7: Ground Truth Evaluation With and Without Context ...................................... 51 Figure 8: Comparing GT Automatic Metrics & Human Evaluation............................... 58 Figure 9: Ground Truth Comparison Between Nebras and Baseline Models ................. 60 Figure 10: NNU Dataset Ground Truth Scores Comparison .......................................... 63 X List of Appendices Appendix A: Google Gemini Response to Accounting Major Required GPA ............... 77 Appendix B: GPT-4o Response to Accounting Major Required GPA ........................... 80 Appendix C: LLM and Mufti Responses to Fatwa on Promoting Products ................... 81 Appendix D: Dataset Scraped Fields Mapping ............................................................... 86 Appendix E: Fatwa Dataset Category Distribution ........................................................ 88 Appendix F: Field Augmentation for Structuring NNU Dataset .................................... 89 Appendix G: Query Decomposition Agent's Prompt ...................................................... 90 Appendix H: Query Classification Agent's Prompt ........................................................ 92 Appendix I: Candidate Answer Agent Prompt ............................................................... 95 Appendix J: Context Relevance Prompt ......................................................................... 96 Appendix K: Answer Generation Agent Prompt ............................................................ 99 Appendix L: Evidence Extraction Agent Prompt ......................................................... 101 Appendix M: Irrelevant Query Response Agent Prompt .............................................. 104 Appendix N: Islamic Fatwa with No Context (Baseline Evaluation) ........................... 106 Appendix O: Prompt for Generating Context-based Answers ...................................... 107 Appendix P: PDR Evaluation ....................................................................................... 108 Appendix Q: Retrieval with Ranking Model ................................................................ 109 Appendix R: HyDE ....................................................................................................... 110 Appendix S: Query to Question ..................................................................................... 111 Appendix T: Query to Topic ......................................................................................... 112 Appendix U: Hybrid Retriever ..................................................................................... 113 Appendix V: Ground Truth Evaluation for 100 Fatwas ................................................ 114 Appendix W: NNU Baseline ......................................................................................... 115 Appendix X: NNU Hybrid Retrieval ............................................................................ 116 Appendix Y: Experimentation Download Link ............................................................ 117 XI NEBRAS: A RAG-BASED QUESTION ANSWERING SYSTEM FOR ISLAMIC AND LEGAL GUIDANCE By Samer Nitham Al-Huwari Supervisor Dr. Hamed Abdelhaq Abstract Question answering (QA) systems are essential tools in natural language processing (NLP), designed to interpret user queries and generate relevant answers. These systems have evolved over time from rule-based models to advanced machine-learning-based approaches. The emergence of the transformers architecture and Large Language Models (LLMs) have set the stage for modern QA systems. LLMs have transformed QA by leveraging vast datasets to generate human-like responses across various domains and their ability to understand complex linguistic patterns. However, LLMs often generates plausible but incorrect answers particularly in specialized domains like law and religion where accuracy is critical. This phenomenon is known as “hallucination”. The risk of “hallucination” is increased when dealing with a complex language like Arabic. Arabic language, with its rich morphology, diverse dialects, and its dependency on diacritics, present significant challenges for LLMs primarily trained on Western languages. Fine-tuning LLMs for domain-specific tasks is time-intensive, and computationally- expensive, given their massive parameters size, demanding innovative approaches to mitigate the LLMs hallucination issue without extensive re-training. This thesis introduces Nebras, a generic multi-domain QA system leveraging a Retrieval- Augmented Generation (RAG) framework, LLM agents, and a hybrid retrieval approach. Nebras’s knowledge base can be dynamically extended by following simple guidelines and using its built-in mapping component, enabling it to adapt to any textual dataset. By employing an Agentic RAG pipeline, Nebras optimizes each processing stage using specialized agents. Furthermore, it utilizes pre-trained LLMs without fine-tuning, enhancing scalability and reducing computational costs. XII Experimentation results demonstrated Nebras’s performance in Arabic domain-specific QA. In the Islamic fatwa domain, it achieved a BERTScore-F1 of 70.94%, a METEOR of 13.49%, with 9 accepted fatwas compared to only 7 accepted from GPT-4o. In the university help-desk domain, Nebras achieved a BERTScore-F1 of 75.80%, METEOR of 40.20%, and BLEU of 9%, significantly outperforming the BLEU score of 2.3% from GPT-4o's. These results highlight Nebras's ability to enhance factual accuracy, confirming its potential as a scalable Arabic QA solution. Keywords: NEBRAS, Multi-domain Arabic question-answering, Agentic Retrieval- Augmented Generation, Large Language Models 1 Chapter One Introduction and Theoretical Background Research in language modeling began in the early 1950s with Shannon's work exploring the predictive and compressive capabilities of simple n-gram models on natural language [1]. Over time, statistical language modeling became a foundational technique in various natural language understanding and generation tasks, such as speech recognition, information retrieval, and question answering [2]. The introduction of neural network-based approaches represented a paradigm shift in language modeling, moving beyond traditional statistical methods [3]. Early deep learning approaches laid the groundwork for future progress by enabling the capture of more complex textual patterns and contextual dependencies [4]. This advancement led to the development of transformer architectures which revolutionized the field by enabling the effective handling of long-range dependencies and parallel processing [5]. These advancements marked a new era of NLP and paved the way for the emergence of LLMs[6]. LLMs have significantly advanced NLP by demonstrating their ability to understand and generate human-like text [7]. LLMs scale, architecture, and their training on extensive and diverse datasets, enabled them to capture hidden and complex patterns in language [8]. These capabilities allowed the models to achieve state-of-the-art performance across a variety of NLP tasks, including but not limited to text summarization, sentiment analysis, machine translation, and question-answering [9] . LLM capabilities are often tested by the complexities of other languages like Arabic [10,11]. Unlike most Western languages, Arabic is characterized by its rich and complex morphology, its dependence on diacritics, and its multiple verb variations and noun forms [12]. This complexity makes accurate language processing a challenge. Diacritical marks (harakat) are often omitted in most standard texts in written Arabic, such as newspapers, books, and other general writing. These marks indicate short vowels and certain grammatical features, and their absence can introduce ambiguity in word meaning and pronunciation [13]. This ambiguity arises because many Arabic words are written as skeletons without diacritics, making the intended word dependent on context. For example, the root "كتب" (k-t-b) can mean "he wrote" (kataba), or "books" (kutub), 2 depending on the diacritics and context. This ambiguity becomes even more noticeable when paired with the wide variety of Arabic dialects, which often differ significantly from Modern Standard Arabic in vocabulary, syntax, and pronunciation [11]. Combining these factors with the lack of high-quality Arabic datasets, lead to weaker performance and frequent misunderstandings in Arabic-language tasks compared to languages with richer datasets [14]. In the field of QA, LLMs typically rely on the knowledge acquired during training and stored within their parameterized memory to generate responses [15]. This reliance makes them exposed to knowledge gaps in specialized domains, and difficulties in processing diverse languages, increasing the risk of generating irrelevant or factually incorrect answers —a phenomenon commonly referred to as hallucination [16,17]. Moreover, the inherent complexity of natural language often results in models misinterpreting questions, leading to responses that are either irrelevant or incorrect [18]. For instance, when posed with the Arabic question: ما هو معدل القبول لتخصص المحاسبة في جامعة النجاح الوطنية؟ (English translation: What is the required GPA for the Accounting major at An-Najah National University?) Google's Gemini 1.5 Pro didn’t provide direct answer about the minimum required GPA, instead it explained why academic majors do not have a fixed minimum required GPA at An-najah national University, resulting in an irrelevant response. The generated response is detailed in Appendix A. When OpenAI’s GPT-4o was presented with the same question, it demonstrated a clear understanding of the query but incorrectly identified the minimum required GPA for admission as being between 80–85% while in 2024, the actual minimum GPA for admission to the Accounting major at An-Najah National University (NNU) is 65%. The complete response from GPT-4o can be found in Appendix B. These challenges are further increased when dealing with specialized domains in general, particularly in sensitive areas like legal counseling and religion, which often require deep knowledge understanding and reasoning, as well as access to information that may not have been available to the LLM during its creation [19]. 3 In Islamic religious context for example, Fatwas serve as a crucial guide for Muslims on legal, ethical, and religious matters. Fatwas are formal statements issued by qualified Islamic scholars (Muftis) after analyzing Quranic verses, Hadith narrations (teachings attributed to the Prophet Muhammad), and the established reasoning of past Islamic scholars and Imams to establish an answer for a specific case [20]. The significance of Fatwas lies in their ability to translate broad Islamic principles into practical guidance for individuals facing unique situations. They offer clarification on permitted actions, ethical dilemmas, and religious obligations in a constantly evolving world. Therefore, inaccurate or misleading Fatwas can have serious religious and legal repercussions for the user. Moreover, Islamic Fatwas can differ among scholars due to variations in jurisprudential schools of thought (madhahib), resulting in a diversity of perspectives on the same issue[21,22]. These factors highlight the importance of accurate and reliable resources when seeking guidance through Islamic Fatwas. Accurately understanding the context of Qur’anic and Hadith verses, and reasoning within Islamic text is crucial for navigating the complexities of Islamic law [11]. Current LLMs might struggle with this depth of domain-specific knowledge. For example, Google's Gemini 1.5 Pro and OpenAI's GPT-4o were asked the following Arabic fatwa question: حكم الترويج لمنتجات لشركة مقابل عمولة مع شرط مسبق على المرّوج بدفع مبلغ مالي (translated to English): What is the Islamic Fatwa ruling on promoting a company's products in exchange for a commission, with a prior condition requiring the promoter to pay a sum of money? Both the Gemini and GPT-4o responses presented multiple rulings for various scenarios, deeming it جائز (permissible) in some cases, whereas the Mufti's response explicitly declared it حرام (forbidden). The detailed responses from both Google's Gemini 1.5 Pro and OpenAI's GPT-4o, along with the Mufti's ground truth answer, are provided in Appendices C - C.1, C.2, and C.3. In contrast, academic domains often involve factual queries requiring precision, clarity, and contextual understanding, such as university policies, course requirements, or research guidelines. These inquiries are typically based on carefully crafted institutional policies and thoroughly reviewed documentation [23,24]. Specialized academic domains 4 pose unique challenges for LLMs due to the diversity, complexity, and constant evolution of educational systems and their policies. Research highlights that LLMs struggle with reasoning tasks requiring specialized knowledge which necessitate efforts in domain- specific data acquisition and fine-tuning [25]. The dynamic nature of academic content and institutional guidelines further complicates matters, as static training data risks outdated or biased responses in such domains [26,27]. Fine-tuning is one approach to mitigate LLM hallucinations, either by expanding the model's knowledge or enhancing its linguistic capabilities. However, updating a model's parameterized memory through fine-tuning is challenging, as there's no established method for directly overwriting existing knowledge, leading to potential knowledge conflicts—an ongoing research area [28,29]. Expanding a model's language capabilities within a specific domain typically involves fine-tuning on data relevant to that domain. If the model lacks initial support for the target language, pre-training on a large corpus in that language is usually necessary before domain-specific fine-tuning [30]. Furthermore, given the large number of parameters in LLMs, fine-tuning demands significant computational resources, making it a time-consuming and expensive process. However, advancements in parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA) or adapter modules, can enable targeted updates without requiring full re-tuning of the model, mitigating some of these challenges but requiring careful tuning of their hyperparameters, which can impact performance if not optimized [35]. An alternative approach to mitigating hallucination in LLMs is RAG. RAG integrates a language model with an external retriever that fetches relevant documents from a knowledge base, enabling the model to generate accurate responses by referencing up-to- date information [19]. This method can be more efficient than fine-tuning because it avoids altering the model's internal parameters, reducing computational costs, and enabling dynamic updates to the knowledge base [32]. However, RAG has its challenges, including ensuring retrieval accuracy, managing outdated or conflicting information, and maintaining retrieval latency at scale [33]. Additionally, applying RAG to languages like Arabic requires effective retrieval mechanisms for morphologically rich languages [34]. Recent advancements in RAG, such as the incorporation of intelligent agents, have improved retrieval and synthesis processes, addressing limitations like retrieval accuracy and latency while expanding the system's capabilities [35–37]. 5 Agents in LLMs leverage their extensive reasoning capabilities to autonomously perform tasks by interpreting complex instructions and execute multi-step processes, acting as intermediaries between users and computational resources [38,39]. These agents demonstrate versatility across diverse domains such as social and natural sciences and engineering, but challenges like maintaining consistent performance, accurate contextual understanding, and seamless tool integration persist [38,40]. Traditional RAG systems lack decision-making and validation mechanisms. In contrast, Agentic RAG integrates intelligent agents to dynamically select, process, and validate relevant information, enhancing response accuracy, contextual relevance, and robustness compared to traditional RAG systems [15,35,36]. In addition, employing agents into the RAG approach, implementing Agentic RAG, gives the ability to handle complex, multi- step reasoning tasks more effectively and deliver contextually appropriate responses [37,41]. This research introduces Nebras, a novel multi-domain Arabic question-answering system designed for adaptability across various domains. Unlike traditional models that require domain-specific fine-tuning, Nebras processes textual datasets provided by users to expand its domain coverage. This capability allows the system to incorporate new datasets provided by administrators without the need for costly and time-consuming re- training processes. Nebras leverages a hybrid retrieval approach that combines multiple techniques to ensure relevant information retrieval from large-scale corpora. It employs Agentic RAG pipeline which consist of multiple agents, each performing a specific task in the system’s proposed pipeline. These agents work in coordination with pre-trained LLMs, enabling Nebras to deliver accurate, factually grounded, and context-aware responses. This design ensures that Nebras remains flexible, scalable, and capable of addressing diverse Arabic language applications across various sectors. 1.1 Theoretical Background This section presents an overview of the theoretical concepts in the field of question- answering systems and recent advancements in the field. 1.1.1 Question Answering QA is a core task within the field of NLP, which deals with enabling machines to understand and process human language. QA systems are designed to answer questions 6 posed by users in natural language aiming to provide relevant, accurate answers by analyzing the input query and generating appropriate responses [42,43]. Based on their reliance on different data structures, QA systems can be categorized into text-based and knowledge-based [44]. Knowledge-based QA utilizes structured knowledge bases (KBs) that store information as triples in the format (subject, predicate, object) [45]. These systems answer questions by querying single or multi-relation facts in the KB, with single-relation questions relying on one fact and multi-relation questions requiring reasoning over multiple connected facts [46,47]. In contrast, Text-based QA retrieves answers from unstructured text, such as documents or articles, by identifying and extracting the most relevant passages that matches the query [45]. These systems typically follow a three-steps pipeline [43]: 1. question processing: involves query formulation and answer type detection using classifiers. 2. document and passage retrieval: which employs information retrieval (IR) models to extract relevant text segments. 3. answer extraction: where the system measures the similarity between the query and candidate answers to determine the most appropriate response. Advanced deep neural models are often employed to enhance text-based QA by accurately matching questions with potential answers [44,45]. 1.1.2 Large Language Models LLMs are computational systems designed to understand and generate human language by leveraging statistical methods to predict word sequences or generate responses based on input [48]. These models achieve remarkable performance in tasks such as text generation, translation, and summarization (to name a few) due to their large-scale training on extensive datasets and their implementation of the Transformer architecture [49]. Central to this architecture is the self-attention mechanism, which enables efficient parallel processing and assigns varying importance to input tokens. This allows the model to capture long-range dependencies effectively [5]. Transformers have powered state-of- the-art language models like Google's BERT [6], Facebook's RoBERTa [50], and OpenAI's GPT-3 and GPT-4 [8,51]. 7 The massive size of LLMs, billions to trillions of parameters (hence the name "Large"), is a critical factor in their performance. This scale enables LLMs to learn complex language patterns and develop sophisticated linguistic abilities [49]. Moreover, LLMs have demonstrated the emergence of novel capabilities, such as in-context learning [52]. This ability allows the models to adapt to specific tasks and generate contextually relevant responses, making them well-suited for a wide range of applications, including dialogue systems [48], step-by-step reasoning [53], even processing none-textual input like images and audio (known as Multi-modal Large Language Models) such as Google's PaLM [54]. As knowledgeable and plausible-sounding they are, LLM-generated responses can be nonsensical or factually incorrect, therefor cannot always be trusted. This phenomenon is common in LLMs and known as "hallucination" [55,56]. LLMs typically relies on the knowledge they learned from the training process and stored in their parameterized memory [15]. Hallucinations often arise from models attempting to fill gaps in knowledge by generating responses based on probabilistic associations rather than verifiable knowledge [24]. The hallucination problem becomes particularly concerning in specialized domains, such as legal, or medical fields, where incorrect information can have serious consequences and can mislead users, undermining trust in LLM-generated responses. Even with their enhanced capabilities in understanding, reasoning, and generation, Multi-modal LLMs are not immune to generate hallucinated content that may appear plausible [57]. One established approach for interacting with LLMs is prompt engineering. This technique involves crafting specific textual prompts to guide the LLM's response generation process[62]. Users can steer the LLM's output towards desired outcomes and tasks by carefully designing these prompts. 1.1.3 Prompt Engineering Prompt engineering enables LLMs to perform a wide array of tasks without requiring retraining nor fine-tuning. practitioners can guide LLMs toward generating contextually relevant and accurate outputs by designing the input prompts thoughtfully, leading to enhancing the LLMs performance [63]. 8 Prompt engineering involves crafting specific instructions or queries (prompts) that encourage the model to generate responses aligned with the user's goal. Recent studies have expanded the landscape of prompt engineering by exploring methods such as zero- shot and few-shot prompting [60,61]. Zero-shot prompting allows LLMs to perform new tasks without any task-specific training by relying entirely on the model’s pre-trained knowledge base [61]. This technique is widely used for large models like GPT-3 where a well-structured prompt can enable the model to perform tasks it has never encountered before [51]. On the other hand, few-shot prompting has shown improvements in handling more complex tasks by providing the model with a few example inputs and outputs, even with minimal additional input data [62]. A notable advancement in prompt engineering is Chain-of-Thought (CoT) prompting which was introduced by Wei et al. in 2022 [53]. CoT prompts guide the model through logical steps, enhancing its ability to process and produce logical, reasoned outputs, which makes it particularly useful for complex reasoning tasks such as mathematical problem solving and commonsense reasoning. Further improvements to this approach, such as Auto-CoT, automate the generation of reasoning chains, thereby enhancing robustness and reducing human effort in creating example-based prompts [63]. The introduction of role-prompting has also improved the specificity of model outputs by assigning a "role" to the model in the prompt, such as "acting as an expert" or "a friendly assistant". This helps guide the model towards more contextually appropriate and accurate responses in various domains [64]. Despite these advancements, challenges persist in optimizing prompts for more complex tasks due to the influence of multiple factors, including task complexity, model biases, and token limitations [65,66]. 1.1.4 Retrieval Augmented Generation RAG is a method that addresses factual hallucination and limitations in domain-specific knowledge for LLMs by incorporating external knowledge through information retrieval [32]. As illustrated in Figure 1, a traditional RAG-based QA system consists of three steps[23]: 1. Indexing: storing vector representations (embeddings) of text chunks. These embeddings allow efficient retrieval of relevant text based on similarity and often 9 stored in a vector database. Vector databases are specialized data management systems designed for storing, indexing, and querying high-dimensional vectors. They support similarity search by using nearest-neighbor search algorithms, enabling efficient retrieval of semantically related data [67]. Unlike traditional relational databases, which organize data in tables with rows and columns and rely on structured query languages (SQL), vector databases handle unstructured data by representing it as numerical vectors. While regular databases perform exact-match searches, vector databases perform approximate nearest-neighbor searches. 2. Retrieval: identifies the most relevant text chunks for a given question. It uses the same embedding model as the vector storage to find similar text based on the question embedding. 3. Generation: generating answers from the retrieved text segments using a language model. The traditional RAG (also known as Naive RAG) adapts the Retrieve-Read method [68] which takes the user's query, matches it against indexed documents, then retrieve the most relevant k documents. Figure 1 Traditional RAG Pipeline The Naive RAG approach faces significant drawbacks across its retrieval, generation, and augmentation stages. During retrieval, it struggles with selecting relevant or well-aligned content and may fail to retrieve essential information, leading to misaligned (inaccurate) 10 responses [76]. In the generation phase, the model often produces hallucinated or unsupported content, which compromises response quality and reliability. Additionally, issues of irrelevance in generated outputs can further reduce effectiveness [19,69]. The augmentation process presents challenges in addressing coverage errors and ensuring retrieved content aligns effectively with task-specific requirements [71]. Moreover, generation models can produce repetitive outputs or fail to integrate retrieved content meaningfully, resulting in responses that reflect surface-level repetition rather than deep contextual understanding [19]. To address these drawbacks, several modifications have been introduced: Index Structuring A structural index enhances retrieval in RAG systems by organizing documents hierarchically, creating multilevel parent-child relationships among document chunks instead of indexing text chunks independently without any relationships or links between them. This method is called Hierarchical Structure Indexing [19] were each child document stores its parent document id to retrieve the entire parent document instead of the text chunk , reducing issues associated with redundant or contextually-disjointed chunks. Figure 2 illustrates a two-level Hierarchical Indexing for a single document. Figure 2 Two-Level Hierarchical Structure Indexing for One Document Another indexing structure method is Knowledge Graph Structure Indexing, where document chunks are represented as nodes, and the relation between chunks are represented as edges. Adding a knowledge graph (KG) index further strengthens this 11 structure by linking concepts and entities within the documents. This approach not only minimizes errors in retrieval but also translates the process into steps the LLM can interpret, leading to more accurate and contextually relevant responses. Methods like Knowledge Graph Prompting (KGP) [72] use KGs to represent document sections as nodes (such as pages or tables) and their connections as edges. This representation allows capturing semantic relationships and enabling coherent knowledge retrieval and reasoning across multiple documents [19]. Chunking Optimization Chunking in RAG is essential for efficient and accurate query answering. It balances providing enough relevant context with minimizing irrelevant data, thus improving retrieval quality and computational performance [73]. Fixed chunk sizes are often used in RAG pipelines and can sometimes lead to insufficient or excessive information within each chunk. Techniques such as recursive chunking and sliding windows address this issue by segmenting content based on natural language structures, like punctuation and sentence boundaries, or by overlapping chunks to preserve coherence [74]. For documents with clear structures, like financial reports, more advanced chunking methods like element-based chunking provides a tailored approach by using structural cues like titles and tables to create chunks, leading to more accurate retrieval [74]. While semantic and agentic chunking strategies provide improved contextual alignment, their increased complexity and computational requirements highlight the trade-off between retrieval accuracy and processing efficiency [73]. Metadata Attachment Incorporating metadata into document chunks can contribute to enhancing retrieval performance in RAG systems, particularly in multi-document contexts [75]. Attaching metadata such as page numbers, document titles, authorship information, timestamps, and other relevant identifiers allows precise filtering and prioritize most recent information, thereby improving retrieval relevance and minimizing the potential for confusion between similar chunks originating from distinct documents [76]. Furthermore, artificially constructed metadata bridges the semantic gap between user queries and document content, like paragraph summaries or hypothetical questions generated by LLMs, resulting into more accurate responses [75,76]. Incorporating metadata annotations also add contextual layers to each chunk therefore improving the RAG system's capacity to retrieve and present coherent information from diverse sources therefore [19]. 12 Re-ranking Re-ranking models are essential in RAG systems, as they refine document retrieval by applying a secondary prioritization based on relevance (secondary to the similarity score) [19]. By reorganizing retrieved chunks, re-ranking ensures that the most relevant content appears at the top, optimizing the document pool that is provided to the language model [19,76]. This prioritization process can be rule-based, relying on metrics such as relevance, diversity, and mean reciprocal rank (MRR), or it can employ model-based methods leveraging advanced natural language processing models [19,76]. Re-ranking addresses the limitations of initial retrieval methods, which often prioritize similarity (e.g., through cosine similarity scores) without fully assessing relevance [19,76]. Advanced re-ranking models, like cross-encoders, are particularly effective in accurately scoring chunk relevance for a given query, often outperforming simpler bi- encoder models in this domain [19]. However, these approaches are computationally intensive, especially those that use pre-trained language models (PLMs) [19,77]. Generative LLMs, such as GPT-3, can further enhance re-ranking by generating synthetic queries for domain-specific training, enhancing the accuracy of the relevance-based ordering without requiring vast amounts of new labeled data [77]. Despite their performance, these re-ranking models are resource-intensive, emphasizing a trade-off between retrieval precision and computational cost [19]. Context Compression Context compression is a technique in RAG systems to optimize performance and reduce inference costs. RAG systems retrieve relevant documents from an external datastore to augment a language model's response, but incorporating full documents as context can quickly lead to excessive token usage, exceeding the model's context length limits and increasing processing time [78]. Instead of simply concatenating numerous documents, context compression selectively simplifies information to minimize noise and highlight essential data, allowing the language model to focus on the most relevant content [78]. Several strategies for context compression have proven effective. One method employs small language models (SLMs) to filter out less important tokens in order to create a compressed prompt. Although the compressed result might seem disjointed to humans, it remains interpretable by LLMs and achieves compression without requiring further LLM 13 training. Other methods train information extractors to identify relevant content within large documents [19]. Combining document reduction with context compression further improves model accuracy [19]. The "Filter-Re-ranker" paradigm, for example, uses SLMs as filters and LLMs as re-rankers to prioritize relevant content. LLMs can also evaluate and critique retrieved content before generating a response, discarding irrelevant information and focusing on key details. These compression techniques are crucial for balancing relevance, token limits, and processing costs in RAG systems while maintaining language integrity [19]. Query Expansion and Re-writing In RAG systems, query expansion enhances retrieval accuracy by adding relevant context to the input query [19]. Hypothetical Document Embeddings (HyDE) achieves this by using LLMs to generate hypothetical contexts, which are then embedded with the original query to improve retrieval precision (Figure 3). This method is particularly beneficial when limited labeled data or explicit knowledge is available, enabling RAG systems to create richer embeddings by incorporating potential relevant details [79]. Figure 3 Query Expansion Using HyDE However, query expansion using LLMs, including methods like HyDE, can sometimes introduce inaccuracies or hallucinations—where hypothetical content diverges from factual information—especially if the LLM lacks knowledge of the query topic. To address this, multi-query approaches expand the initial query into several targeted queries, while sub-query methods break down complex questions into simpler prompts [19]. 14 1.1.5 Agents in Large Language Models LLM agents are designed to autonomously perform tasks by leveraging the models' reasoning abilities [38]. These agents can interpret complex instructions and execute multi-step processes, acting as intermediaries between users and computational resources [39]. Agents demonstrated their adaptability and potential for transformative impact in various fields, including social sciences, natural sciences, and engineering [40]. However, challenges remain in maintaining consistent performance, ensuring accurate contextual understanding, and achieving seamless integration with external tools [38]. Ongoing research focuses on improving the agents autonomy, reliability, and human-like interaction [80]. Agentic RAG leverages LLM agents to enhance the retrieval of information and overcome traditional RAG systems limitations [15] by incorporating intelligent agents that dynamically select and process relevant information, thereby improving response accuracy and contextual relevance [35]. This approach enables the system to handle complex, multi-step reasoning tasks more effectively than naive RAG, which lacks such dynamic decision-making capabilities [37]. Furthermore , by leveraging agents with access to various tools, Agentic RAG can route queries to specialized knowledge sources, leading to more accurate and contextually appropriate responses [36]. In contrast, naive RAG systems directly use retrieved information without additional processing, which may result in coherence issues in generated responses [15]. 1.2 Literature Review QA systems research began in the early 1960s, marking a significant area of study within natural language processing [81]. Early efforts often adapted rule-based methodologies, exemplified by the system proposed by Riloff and Thelen in 2000 [82]. However, their work [82] highlighted several inherent limitations of this approach, including the resource-intensive nature of rule creation, sensitivity to variations in wording and sentence structure, and limited ability to draw inferences. Further challenges arise from coreference resolution, contextual interpretation, and ambiguity management. Scalability is also a constraint, and errors from earlier processing stages can cascade, negatively impacting overall system performance. The manual crafting required by rule-based systems complicates their maintenance and updates, and their inability to generalize can lead to performance decline with larger datasets [83]. 15 In the pre-transformer era, machine learning and NLP techniques were combined in various models to address the challenges of rule-based systems. Poon et al. [84] advanced machine reading by combining probabilistic reasoning with NLP to infer meaning from text, aiming to improve structured knowledge extraction from unstructured data using statistical methods. Lende and Raghuwanshi [85] similarly used NLP techniques like part-of-speech tagging, named entity recognition, and syntactic pattern matching to build question-answering systems for educational texts, focusing on interpreting and retrieving relevant information. IR techniques were employed for extracting relevant text segments from extensive document collections in addition to machine learning-driven methods in order to enhance the overall performance. Dwivedi and Singh [86] conducted a comprehensive review of question-answering systems showing how IR systems employed different techniques such as document indexing, keyword matching, and ranking algorithms to locate relevant information. The field was further advanced by the development of probabilistic IR models and the introduction of statistical methods for relevance estimation [87]. Early efforts, such as vector space models [88] and latent semantic analysis [89], laid the groundwork for more sophisticated retrieval systems. Despite their contributions, these early systems were often constrained by their reliance on complex NLP pipelines that required accurate linguistic annotations and handcrafted rules. This dependency not only increased the computational overhead but also limited their scalability and adaptability across diverse domains and languages, as emphasized by Lende and Raghuwanshi [85]. Since the introduction of the Transformer architecture [5] and the introduction of BERT [6], a few systems were introduced as extractive question answering systems [90–93]. The methodologies [90–93] involved fine-tuning BERT or its variants (like RoBERTa) on domain-specific datasets, often with enhancements like hybrid architectures, semantic layers, or de-biasing strategies to address specific challenges in extractive QA. Most focus on leveraging transformer-based embeddings for efficient answer extraction. A common set of challenges emerges across research on extractive question answering using BERT and its variants. Many models exhibit domain and dataset dependence, performing well only on datasets similar to those they are fine-tuned on, which limits their generalization ability to diverse or unseen contexts [91,94]. Bias issues, such as position bias, cause models to overly rely on the location of answers within passages rather than their content [92]. Additionally, advanced architectures and mechanisms for improving performance 16 often come with increased computational overhead, making them less practical for real- world applications [90]. Language limitations are significant, particularly in low-resource languages, where models struggle due to insufficient pre-training data and linguistic complexity [91]. Finally, models often face challenges with contextual understanding, such as handling complex sentence structures or reasoning across multiple sentences, reducing their effectiveness in answering complex queries [90,93]. A methodology proposed by Alkhurayyif and Sait [95] for Arabic question answering involves data preprocessing, named entity relationship identification using the Multinomial Naïve Bayes algorithm and Named Entity Recognition (NER), and response retrieval leveraging ELMo embeddings and Quaternion Long Short-Term Memory (QLSTM) networks. However, the system faces several challenges, including sensitivity to real-world variability, difficulty handling complex Arabic sentence structures, limited support for non-textual queries, and inefficiencies in training time. The complex morphology of Arabic, particularly verb-noun patterns, hinders contextual understanding, and scalability remains a concern due to its design being tailored for Arabic-specific tasks. Building on the foundational advancements of BERT, LLMs emerged as a transformative innovation in NLP, characterized by their unprecedented scale and capability [96]. These models significantly enhanced their ability to understand and generate contextually relevant text, achieving remarkable performance across a wide array of NLP tasks [97]. The adoption of transformer-based architectures, as outlined by Gillioz et al. [98], laid the foundation for these advancements, enabling LLMs to handle tasks such as machine translation, text summarization, and sentiment analysis with unprecedented accuracy and scalability. The vast pre-training of LLMs on diverse datasets across multiple domains has provided the models with extensive general knowledge, making them highly effective for open-domain applications. Kamalloo et al. [99] demonstrated LLMs' ability to provide contextually rich and accurate answers in open-domain question answering, leveraging their training on large datasets. However, despite their capabilities, LLMs are prone to hallucinations, particularly in specialized and closed-domain scenarios [100]. While LLMs exhibit impressive capabilities, they are prone to hallucinations, particularly in specialized and closed-domain contexts. Fine-tuning with domain-specific data has been shown to enhance performance by improving knowledge representation and accuracy within those domains. For instance, Guo and Hua [101] employed continuous 17 training and instruction fine-tuning to adapt Meta's Llama 2 base models to the Chinese medical domain, achieving performance comparable to OpenAI’s GPT-3.5-turbo in medical question answering. Similarly, Singhal et al. [102] proposed fine-tuning LLMs with specialized medical datasets to improve their ability to address expert-level medical inquiries accurately. Huang et al. [103] further demonstrated this principle by introducing "Lawyer LLaMA", a legal domain LLM that incorporates domain knowledge through continuous training and acquires professional skills via supervised fine-tuning, effectively adapting the model to legal-specific queries. These fine-tuning approaches enhance domain accuracy, challenges remain, notably in supporting multilingual queries. Many LLMs are initially trained on primarily English datasets, which may limit their ability to handle languages with less training data [104]. The research by Xu et al. [105] discusses significant limitations in multilingual LLMs, such as language imbalance and multilingual alignment, which can decline the performance in low-resource languages. Additionally, LLMs require frequent updates to their knowledge base to keep up with rapid advancements and evolving information in specialized domains. Since the knowledge in an LLM is "learned" within its parameters, updating the model’s knowledge typically involves computationally expensive retraining or fine-tuning [72] and LLMs may struggle to integrate new factual knowledge effectively, instead reinforcing their pre- existing knowledge, which can lead to increased hallucinations [106]. Chang et al. [107] highlight that adding multilingual data can improve low-resource language modeling performance, but as dataset sizes increase, adding more data may begin to decrease the performance due to limited model capacity, known as the "curse of multilinguality" [107,108]. This limitation poses significant challenge in making LLMs fully reliable and adaptable across diverse and dynamic specialized domains. RAG has emerged as a promising alternative to traditional fine-tuning approaches for addressing the limitations of parameter-based language models [15]. While fine-tuning involves adapting a model by updating its parameters on specific datasets, it is often constrained by the static nature of learned knowledge and the risk of forgetting. In contrast, RAG dynamically integrates external knowledge sources, allowing models to access and leverage up-to-date information beyond their initial training data [15,19]. Several question-answering systems have been proposed using RAG, demonstrating its adaptability. For example, the English question-answering system DPR-RAG [109] integrates dense passage retrieval to enhance the accuracy of generated responses, 18 offering significant improvements over traditional approaches. However, its reliance on dense embeddings can result in poor generalization to domains or languages with limited training data, and irrelevant passages retrieved during the process can degrade output quality. A research focusing on Islamic question answering is MufassirQAS [110], which employs RAG to enhance Arabic question answering, particularly within Islamic studies domain by integrating a vector database of Turkish-translated Islamic texts and prompt engineering to semantically search for relevant information and provided it to the LLM. Nonetheless, the authors didn't show any evaluation criteria, metrics, or scores in their research to reflect the system's performance, although they mentioned that the system's effectiveness is limited when faced with larger contexts. Integrating and summarizing a large number of context chunks can disrupt coherence and the flow of information, resulting in fragmented responses. This challenge suggests a need for improvements in connecting and synthesizing retrieved data to produce cohesive outputs. Existing research in QA systems demonstrates significant advancements but also reveals notable limitations. The MufassirQAS LLM [110] struggles with larger contexts, leading to fragmented responses and limited applicability to broader questions. Similarly, the Arabic QA system leveraging deep learning techniques [95] shows strong performance but suffers from scalability issues, inefficiencies in handling complex Arabic morphology, and a dependency on extensive training and fine-tuning processes. Lastly, there is a noticeable lack of generic QA systems that can seamlessly adapt to new knowledge bases without requiring significant reconfiguration or retraining, highlighting the limited flexibility and scalability of current approaches. 1.3 Problem Statement LLMs like OpenAI's ChatGPT and Meta's Llama are often pre-trained on vast datasets of internet text enabling the models to learn world information, grammatical phrasing, vocabulary, and lingual context [111]. While these LLMs impress with their natural language abilities [9], they encounter distinct challenges regarding query understanding or generating response when applied in some specialized domains. The challenges are caused by: the broad dataset they were trained on, conflicting facts in the dataset, or outdated information, among other factors [24]. These challenges are even increased when the LLM is dealing with complex languages like Arabic [112]. 19 The phenomenon of generating nonsensical, inaccurate, or factually-incorrect responses is known as "hallucination" [24]. In the field of question-answering, some sensitive domains like legal counseling or Islamic fatwa, such inaccuracies are not tolerated. In Islamic fatwa for instance, where interpretations of Qur’an and Hadith verses are crucial, even minor errors are unacceptable. Additionally, user queries may involve culturally- specific and geographically-dependent matters. For instance, one may ask about the permissibility of using a specific local bank or investing in a regional company. Such inquiries require a deep understanding of local culture, including events and institutions. While current LLMs trained on localized data show some cultural awareness, they often lack in-depth comprehension, hindering accuracy in specialized topics [113]. Moreover, LLMs like ChatGPT and Gemini often use content filtering to avoid sensitive, ethical, or potentially harmful topics [114]. While intended for responsible use, these filters can limit the scope and effectiveness of LLMs for specialized applications like Fatwa inquiries, where culturally specific responses are crucial. Fine-tuning LLMs for specialized domains is the traditional solution. However, the enormous parameter counts of these models requires significant computational resources for fine-tuning. Robust LLM that can understand the text consists of billions of parameters, for example OpenAI's GPT-3 models have over 175 billion parameters, Google's PaLM has over 540 billion parameters [7], and Meta's Llama 3.1 family has 405 billion parameters in some models [115]. Training the GPT-3 davinci model took over 3 years and it cost about 4 million USD to train its 175B parameters. OpenAI's GPT-4 on the other hand, has trillions of parameters, took about 3 years and 6 months to train and it cost about 90 million USD given the development in GPU chips which became more computationally powerful [116]. Even in smaller models like Meta's Llama 3.1 8B, fine-tuning can be resource-intensive. Techniques such as PEFT, LoRA, and adapter modules reduce computational requirements for the fine-tuning process, yet the process can still be time-consuming on systems with limited resources and requires high end hardware [117]. For instance, full fine-tuning of Llama 3.1 8B typically requires approximately 16 GB of GPU memory, which is manageable on a high-end consumer GPU [115]. Implementing techniques like LoRA and adapter modules further reduces memory requirements, making fine-tuning feasible on GPUs with even lower memory capacities. However, these methods often introduce additional complexity due to the need for careful hyperparameter tuning, which 20 can impact performance if not optimized [118]. Additionally, these approaches may struggle to generalize beyond the specific tasks or domains they were fine-tuned for, limiting their flexibility [117]. Parameter-efficient fine-tuning also typically retains the original base model’s limitations, such as vulnerability to hallucination in contexts outside the fine-tuned scope [113]. Another concern in fine-tuning is the structure of parameterized LLMs makes it difficult to predict performance before the fine-tuning process is completed [15]. More importantly, modifying learned data remains an ongoing research challenge [120]. This often requires complete re-tuning, as there is currently no established method for LLMs to "unlearn" outdated information effectively [121]. Additionally, parameterized-memory LLMs lack the inherent capability to provide references for their generated responses, which limits their reliability in producing verifiable outputs [15,122]. This study introduces a generic Arabic question-answering framework capable of answering questions across various domains and adapting to any textual dataset without requiring fine-tuning. The framework processes questions posed in Arabic and generates a response consists of two key elements: 1. Answer: the generated answer to the user's question. 2. Evidence (if applicable): Supporting references, which may include Qur'anic or Hadith verses in the context of Islamic fatwa, or relevant legislation for legal counseling. This component is provided only when the question requires evidence to support the generated answer. 1.4 Aims of Study This study aims to introduce Nebras, a novel Arabic question-answering system designed exclusively for specialized domains. Nebras addresses the unique requirements of Arabic language applications while providing a scalable, efficient, and context-aware solution for generating accurate, factually correct answers. The system’s design reflects several contributions that distinguish it from traditional systems. One of the key contributions of Nebras is its adaptability. The system’s knowledge base can be expanded and managed dynamically. Administrators can add new textual datasets tailored to specific domains, map relevant fields from the dataset for indexing and 21 retrieval, and modify the knowledge base by updating or deleting existing documents as needed. This design eliminates the need for costly and time-intensive model fine-tuning and allows Nebras to seamlessly adapt to various domains while maintaining an up-to- date and relevant knowledge base. Nebras employs a hybrid retrieval approach that combines multiple techniques to ensure relevant and accurate information retrieval. Its implementation leverages the Agentic RAG framework where specialized agents collaborate to process tasks effectively. Nebras addresses the unique challenges of the Arabic language by prioritizing linguistic and contextual precision. This study demonstrates its potential to advance question- answering for Arabic-speaking users in specialized domains by proposing a robust, scalable, and adaptable solution to meet diverse application needs. 1.5 Hypotheses of Study The following hypotheses have been formulated to guide the research, evaluate the effectiveness of Nebras, and address the challenges identified in the problem statement. Nebras is a novel Agentic RAG-based system introduced in this study, designed for complex Arabic question answering across specialized domains. 1.5.1 Accuracy Hypothesis Nebras is expected to outperform existing LLMs in providing accurate and contextually relevant answers to Arabic queries in specialized domains. Nebras enhances response reliability and reduces hallucinations (factually incorrect yet plausible-sounding answers) by dynamically integrating external textual knowledge bases. Nebras's performance will be compared to a baseline established by evaluating responses generated by several top- performing models, using both automatic metrics and expert evaluations on domain- specific datasets. 1.5.2 Adaptability and Scalability Hypothesis Nebras's dynamic knowledge base management system allows seamless expansion and modification which enables the system to incorporate new datasets for emerging domains without fine-tuning. Nebras offers a scalable and cost-effective Arabic question- answering solution deployable even in resource-constrained environments by leveraging 22 pre-trained LLMs. This will be validated by analyzing Nebras’s computational efficiency (memory and processing requirements) and its scalability to large datasets. 1.5.3 Language-Specific Performance Hypothesis Nebras is hypothesized to outperform existing systems in handling the linguistic complexities of Arabic (morphology, syntax, and dialectal diversity). Its design is expected to give it an advantage over general-purpose LLMs when processing complex Arabic queries. This will be evaluated using both lexical and semantic metrics, and contextual relevance ratings. 23 Chapter Two Methods This chapter details the development and implementation of the Nebras, explaining the design choices, algorithms, and frameworks used to process Arabic queries across multiple domains. It covers the hybrid retrieval approach, Agentic RAG framework integration, use of pre-trained LLMs, and the system's dynamic knowledge capabilities. Subsequent sections explore implementation aspects, highlight its key components, and clarify the system’s operation. 2.1 Data Collection This research focuses on two distinct domains: Islamic Fatwas and University Information Help Desk. Each domain presents unique challenges and requires specific data structuring. The Islamic Fatwa domain, due to its sensitive nature and the need for accurate, well-supported answers, demands careful handling. Fatwas require precise reasoning and the inclusion of correct and specific daleel (evidence) aligned with Islamic jurisprudence. This daleel, often from the Qur’an or Hadith, must be accurately referenced; the system must not generate or fabricate it. This makes the domain an ideal choice for testing the system’s ability to generate contextually accurate responses with necessary supporting information. Additionally, this domain allows for the evaluation of the system’s performance with question-answer data structures, as fatwa data inherently follows this format. Data for the Islamic Fatwa domain is sourced from two reputable websites: 1. NNU Fatwa Website (https://fatwa.najah.edu/): Maintained by the Faculty of Shari'a at An-Najah National University, a respected institution in Islamic scholarship, this website contains over 1,500 fatwas. These address local issues and inquiries specific to Palestine, making the data highly relevant and credible. 2. Islamweb.net (https://islamweb.net): Overseen by the Ministry of Endowments and Islamic Affairs in Qatar, Islamweb.net offers a vast repository of over 160,000 fatwas. Its credibility and reliability make it a valuable resource for this research. https://fatwa.najah.edu/ https://islamweb.net/ 24 These carefully selected sources and their credible data sources ensure a robust evaluation of the proposed system’s ability to handle diverse and complex Arabic queries, contributing to the development of Nebras, the proposed question-answering framework. 2.1.1 Islamic Fatwa Dataset Collection To gather data from these sources, a custom scraping template is developed for Islamweb.net to capture specific fields essential for the system responses. These fields, as explained in Table 1 and highlighted in Appendix D - Figure D11, include categories, fatwa topic summaries, unique fatwa identifiers, dates, questions posed, and the detailed answers provided by muftis. Each entry is appended with a static "source" field with the value "islamweb.net" to facilitate data source tracking. Table 1 Islamweb Fatwa Fields Mapping Field Description Field Name Data Type 1 Fatwa categories can be extracted from fatwa breadcrumbs. categories Array 2 Fatwa topic. A short summary of what this is about. topic string 3 Fatwa id. fatwa_id integer 4 Fatwa date in Gregorian and Hijri. date string 5 Fatwa question asked by a user. question string 6 Fatwa answer from muftis at islamweb. answer string The initial scraping of Islamweb.net resulted in a dataset of 164,310 fatwas before pre- processing. For the NNU Fatwa website, similar scraping techniques were used to obtain 1,500 locally relevant fatwas, enriching the dataset with inquiries specific to the Palestinian context. 2.1.2 University Help-desk Dataset Collection The dataset for the university help-desk dataset is collected exclusively from NNU and comprises information specific to NNU. The NNU dataset, by contrast, belongs to a different knowledge domain, focusing on academic and institutional information represented in a document-based, semi-structured format rather than a question-answer structure. This dataset includes factual information about majors, courses, admission requirements, and faculty members, allowing the evaluation of the ability of the proposed 25 system to retrieve and generate accurate responses from semi-structured documents. The process of collecting data for the NNU dataset involves parsing of web pages from the university's official site (https://najah.edu), with additional details on the scraping techniques and categorization to be explained in the subsequent subsections. However, the NNU website’s protection measures prevent automated data scraping, making it difficult to automatically gather information across all academic programs. Consequently, a manual data collection process is implemented for a selected group of medical and IT-related majors. Due to the time-intensive nature of manually gathering data for over 200 majors, this selective approach is considered necessary for the research. Academic Data Collection The process of collecting academic data focused on obtaining detailed information regarding the academic majors available at An-Najah National University at both the undergraduate and graduate levels. This data includes details for each major, covering its title, affiliated faculty, academic degree (e.g., Bachelor's, Master's, Doctorate), duration of study, and a corresponding URL. The specific data fields extracted during the collection process are illustrated in Table 2 and highlighted in Appendix D - Figure D12. Table 2 NNU Academic Majors Fields Mapping Field Description Field Name Data Type 1 Academic major title major_title string 2 Major's faculty faculty string 3 Academic degree degree Enum (bachelors, master, doctorate) 4 Major's study duration duration string 5 Major's info URL url url 6 Document type doc_type string The description of each academic major is displayed on its own dedicated page, necessitating a visit to the page for data extraction. For selected majors in medical and IT-related fields, as a sample, these academic major pages were manually visited to retrieve their descriptions. While visiting each academic major's page, information related to the major curriculum and courses is collected. Table 3 shows information about the collected fields. The fields are also illustrated in Appendix D - Figure D13 for further reference. The "doc_type" field holds the document's category and set manually while https://www.najah.edu/ar/academic/undergraduate-programs/ https://www.najah.edu/ar/academic/graduate-programs/ 26 collecting the data. The values used to distinguish the documents are: "admission", "staff", "academic_major", and "academic_course" for majors, courses, staff, and admission documents, respectively. Table 3 NNU Academic Courses Fields Mapping Field Description Field Name Data Type 1 Plan version version string 2 Curriculum section section string 3 Course number course_number integer 4 Course title title string 5 Credit hours credit_hours string 6 Course prerequisites prerequisites string 7 Course description description string 8 Document type doc_type Enum (academic_major, academic_course, staff, admission) Administration-related Data Collection The collection of administrative data is limited due to the same protection measures mentioned earlier, information regarding the university's governance and faculty deans is manually obtained. Some information regarding admission and acceptance is found on Nawarat An-najah (https://nawarat.najah.edu/), a subdomain for NNU for newly registered students. By preparing these two datasets, the study can assess the QA system's performance in generating answers in different domain with different dataset structures. 2.2 Data Pre-processing and Structuring Effective data structuring is essential for compatibility with the proposed system. Due to the distinct nature of the Islamic Fatwa and NNU datasets, each requires specific processing steps for organization and standardization. The following subsections detail the structuring methodologies for each dataset. 27 2.2.1 Islamic Fatwa Dataset Pre-processing and Structuring This section details the cleaning and preprocessing steps applied to the Islamic fatwa data. These steps removed noise, inconsistencies, and irrelevant information to enhance the dataset's quality. The following subsections provide a comprehensive overview. Missing Values and Exploratory Data Analysis Data completeness is assessed by checking for missing values (NaN, whitespace-only values, or zeros in integer columns). The only column with missing values was the "topic" column with count of 12 missing values. To mitigate the impact of missing data on subsequent analysis, these missing values were replaced with human-generated descriptions derived from a careful analysis of the corresponding fatwa's context. To gain insights into the textual content of the dataset, an exploratory data analysis (EDA) is conducted on the "topic", "question", and "answer" fields. The text in these fields is tokenized by splitting it into individual words using single space delimiter, allowing for an analysis of token counts. Table 4 presents the maximum, minimum, and mean token counts for each field. Table 4 Token Counts from the Scraped Islamweb Dataset Field Max. Count Min. Count Mean topic 50 1 7.68 question 2.191 2 78.39 answer 6,076 2 211.44 The "question" and "answer" columns exhibited a minimum token count of 2, which raised concerns about the potential presence of very short or incomplete entries. Further inspection of these low-token-count entries was postponed until after the text cleaning process. Muftis often begin fatwas with introductory sentences that do not contribute directly to the factual content or reasoning of the answer. The length of these messages can influence the chunking process and potentially impact the retrieval process. To extract introductory sentences, the line breaks in the scraped fatwas are used. These sentences are generally found in the first line, although this pattern is not consistent across 28 all fatwas. By examining the first lines, the frequent introductory sentences are extracted through the following steps: − Splitting each fatwa by line breaks and the HTML
tag. − Extracting the first non-empty text result from the split array, excluding empty HTML tags or whitespace. − Calculating the frequency of each extracted sentence across the dataset. This process identified a total of 17,034 unique introductory sentences. The sentence بعد“ أما آله وصحبه appears most frequently in the ”الحمد هلل والصالة والسالم على رسول هللا وعلى entire dataset, with a total of 138,281 occurrences (84%). The second most frequent introductory sentence is “ أما بعد ومن وااله وعلى آله وصحبه محمد نبيناالحمد هلل والصالة والسالم على ” which appears is 4,241 fatwas but have the same leading and trailing words. Text Cleaning The text cleaning process involved several steps: stripping HTML tags, removing non- Unicode characters, and eliminating duplicate tokens. Line breaks were preserved at this stage, as they might be beneficial in later processing steps. Removing Introductory Sentences Introductory sentences in fatwas do not contribute directly to the factual content or reasoning of the answer and usually follow a structural pattern. To remove them without affecting the context or valuable information, the process involves: extracting starting and trailing words from each introductory sentence to construct a pattern, identifying the first sentence in each fatwa, and removing it if it matches the pattern. Further analysis of these words, as detailed in Table 5, reveals that “الحمد” is the most frequent starting word in introductory sentences (98.5%). The trailing words for sentences beginning with "الحمد" are also examined, with results presented in Table 5. 29 Table 5 Frequency of Start and Trailing Words in Introductory Sentences in Fatwa Word Frequency Percentage Starting Words %98.5 160,870 الحمد %1.39 2,262 خالصة %0.054 89 الخالصة Trailing Words for Sentences Strating with “ الحمد” %51.8 84,580 بعد %35.9 58,699 بعـد %8.45 13,814 أعلم %1.39 2,268 الفتوى %0.36 580 وبعد After removing the introductory sentences, the frequency of the starting words is recalculated and displayed in Table 6. Table 6 Frequency of First Words After Removing Introductory Sentences in Fatwa Word Frequency Percentage %16.67 27,234 فإن %10.65 17,403 الحمد %8.99 14,680 فقد %8.47 13,837 فال Although the word " الحمد" maintains a relatively high frequency, likely it is part of the fatwa answer, completely eliminating it could potentially compromise the accuracy of the factual context within the fatwa. Subsequent Text Cleaning and Re-evaluation Following the removal of introductory messages, the dataset is re-evaluated for missing values and token counts to assess the impact of text cleaning steps on token counts. The 30 evaluation showed no missing values in any dataset columns. The results for token counts are presented in Table 7. Table 7 Token Counts from the Scraped Islamweb Dataset After Cleaning Field Maximum Count Min. Count Reduction Before Cleaning After Cleaning topic 50 50 1 0 question 2.191 2,119 1 3.29% answer 6,076 5,436 3 10.53% A manual review of questions comprising only a single word revealed a typographical error in one query. The original, malformed question was: قال.هللا.أالتزر.وازرة.وزر.أخرى.فكيف.يتفق.ذلك.مع.الشفاعة.ومع.إهداء.ثواب.األعمال.إلى.األموات؟ Given the context and grammatical structure, it is clear that this query represents a user input error. The question was subsequently corrected to accurately reflect the intended meaning by removing the periods and replace them with spaces. For answers with fewer than 10 words, it was observed that some referenced other fatwas using phrases like: .10794 ورقم:، 1640 سبق برقم: − .فتراجع 32689: فتفصيل ذلك سبق في الفتوى رقم − This suggests that extracting fatwa references from answer texts could be valuable. Regular expressions can be employed to identify these reference numbers. The references 1640, 10794, and 32689, mentioned in the short fatwa answers, were not found within the dataset and are subsequently removed. HTML Tag Removal and Text Formatting An analysis of randomly selected fatwas revealed that HTML tags were primarily used for text formatting and coloring, without a standardized structure for different fatwa components such as daleel (evidence) or references. This inconsistency made it impractical to leverage HTML formatting for information extraction. 31 Category Extraction The final step in text pre-processing involved categorizing the fatwas to provide a structured taxonomy for subsequent analysis and retrieval. The category names were extracted from the fatwa breadcrumbs (Field 1 in custom scraper template), and the word .is removed to streamline the categorization process (main) ”الرئيسية“ The first two words of the category names are retained to represent the main and secondary categories, respectively. This hierarchical categorization approach provides a more granular understanding of the fatwas' topics. The distribution of fatwas across the identified categories is visualized in Appendix E - Figure E14. 2.2.2 NNU Dataset Pre-processing and Structuring The following subsections outline the structuring process for each NNU document type, including academic majors, academics, courses, staff, and admissions. Academic Majors Document Structuring For academic major documents, additional data from other scraped fields is merged into the “description field” (as shown in Table 2) to create an entry suitable for indexing. These fields include “major_title”, “faculty”, “degree”, and “duration”. By enhancing the “description” with relevant information from these fields, it becomes more concise and informative. The resulting enriched text is stored in a newly created field, "content," specifically designed for indexing and similarity searches. The template for the "content" field is provided in Appendix F. Staff and Admission Document Structuring No augmentation is applied to these documents; their content is indexed directly in the vector database. Staff documents consist solely of staff member biographies, while admission documents outline the admission rules for new student enrollment. Generating “topic” for NNU Documents A "topic" field is added to each collected document, providing a brief summary about the document’s content. This summary is generated using a LLM with the following prompt: .التركيز الرئيسي للمستندحلل المستند التالي واستخرج وصًفا قصيًرا للموضوع الذي يلخص الفكرة أو “ .كلمة، ويعكس الموضوع األساسي للنص 50-20يجب أن يكون الوصف موجًزا، ويفضل أن يتراوح بين 32 {text}“ إليك المستند Which translates to: “Analyze the following document and extract a brief description of the topic that summarizes the main idea or focus of the document. The description should be concise, ideally between 20-50 words, and should reflect the core subject of the text. Here is the document: {text}” The “topic” field is also indexed in the vector database, resulting in each document having two embedding vectors: one for the content field and another for the “topic” field. The data structuring process standardizes and optimizes both the Islamic fatwa and NNU datasets for seamless integration with the QA system. By tailoring the structuring methods to the unique requirements of each dataset, this process ensures that the system can effectively handle both question-answer data as well as document-based datasets. The structured datasets, with enriched content fields, facilitate accurate and efficient similarity searches and response generation, setting a solid foundation for the subsequent stages of system experimentation and evaluation. 2.3 Implementation The following subsections delve into knowledge base preparation and the answer generation pipeline. 2.3.1 Vector Database To ensure adaptability across various textual datasets, a mapping component is proposed for dataset preparation in the QA pipeline. Textual datasets serving as knowledge bases for question answering can be categorized into: 1. QA-based: Datasets comprising paired questions and answers. 2. Document-based: Datasets containing documents on specific topics (e.g., PDFs). For QA-based datasets, essential fields include: − Document ID: Unique identifier for each document. − Question: The posed query. − Answer: The corresponding correct answer. − Topic: A concise description of the question's subject. 33 The “question”, “answer”, and “topic” fields are vectorized and indexed in the vector database, while additional fields serve as metadata. For document-based datasets, required fields are: − Document ID: Unique identifier for each document. − Content: Full document content. − Topic: A brief summary of the document's content. The “content” and “topic” fields are vectorized and indexed while any extra fields act as metadata. In regards to the vector database collection, each collection must have the following metadata fields: − Title: A human-readable collection name. − Description: Summary of the collection’s content. − Type: Dataset type ("qa" or "docs"). − ID Field: Field name for uniquely identifying documents. A collection’s metadata example: { "name": "nnu", "title": "An-Najah National University", "description": "This collection includes comprehensive documents about An-Najah National University, covering admission processes, academic programs, and general university details.", "type": "docs", "id_field": "document_id" } By providing the necessary metadata, the system can efficiently manage and integrate new collections into its knowledge base. Users need only map relevant fields and provide collection metadata to incorporate datasets seamlessly as illustrated in Figure 4. 34 Figure 4 Dataset Field Mapper 2.3.2 Indexing and Chunking During the indexing phase, documents are processed, segmented, and converted into embeddings, which are subsequently stored in a vector database. The quality of the index construction is critical in determining whether the appropriate context can be accurately retrieved during the retrieval phase [19]. The indexing structure introduced in this research utilizes a Hierarchical Index Structure with attached metadata. This hierarchical approach addresses the issue of context mismatch that arises when retrieved chunks are semantically incomplete. Determining an optimal chunk size is a delicate process that requires balancing considerations. Chunks that are too long introduce noise to the embedding model and requires more processing. Additionally, if a chunk exceeds the model’s maximum input length, it will be truncated, leading to loss of meaning. In contrast, chunks that are too short may prevent the embedding model from properly capturing the context. Incorporating a hierarchical index enhances retrieval allows the model to reconstruct the context [15,19]. The process Involves the following steps: − Assigning a unique identifier to each document (if not already provided). 35 − Split the documents into smaller, fixed-size chunks. Mapping each chunk to its originating document, thereby creating a parent-child relationship (hierarchy). − Attaching metadata to documents enhances the filtering process and enriches the document’s content. Achieving an appropriate chunk size involves a series of tests across different models and chunk sizes to find an effective balance which is illustrated in “Chunking Optimization” section. 2.3.3 QA Pipeline Traditional RAG frameworks operate in two main steps: retrieving relevant documents and generating responses based on these documents [15]. While effective for straightforward queries, standard RAG often lacks the capacity for complex reasoning or task decomposition across multiple nodes [36]. Nebras’s QA pipeline is implemented using an Agentic Retrieval-Augmented Generation approach leveraging a Graph-workflow framework for structuring nodes within the pipeline, where each node (or agent) has a very specific task. The framework organizes the workflow by enabling different agents and models at each step, handling discrete tasks rather than dynamically retrieving graph-based data. The pipeline workflow is illustrated in Figure 5. The pipeline utilizes a series of specialized agents to ensure the reliability and precision of the responses. The process begins with Query Decomposition Agent, where complex questions are handled by breaking them down into smaller, well-defined, and concise sub- questions. These are then processed by the Query Classifier Agent to determine the type of question being asked. Next, the Candidate Answer Generation agent formulates potential answers, which are further refined by the Retriever to identify the most relevant documents. The Context Relevancy agent assesses the retrieved information to ensure its applicability to the query. Finally, the Answer Generation agent constructs the final response, and the Evidence Extraction agent provides the answer with supporting evidence from provided text (if any), ensuring the response is both accurate and well- supported by authoritative references 36 Figure 5 QA Agentic RAG Pipeline The following sections delve into each agent within the QA pipeline, outlining their individual roles in the pipeline. 37 Query Decomposition Agent The query decomposition agent aims to decompose complex user queries into more focused sub-queries. This process involves breaking down the original query into smaller parts, rephrasing them for clarity and conciseness, and correcting any spelling or grammatical errors. The agent is specifically designed to rephrase user queries into a format that aligns with the systems knowledge domains. For instance, a query such as: ما حكم الربا؟ كيف أقدم على تخصص علم الحاسوب؟ شحال الثمن ديال هاذ الكتاب؟ ما هي شروط القبول في الجامعة؟ would be decomposed into the following sub-queries: ما حكم الربا؟ - كيف أقدم على تخصص علم الحاسوب في جامعة النجاح الوطنية؟ - ما هو سعر هذا الكتاب؟ - ما هي شروط القبول في جامعة النجاح الوطنية؟ - To achieve the dynamic purpose of the system, the title field from the collections’ metadata will be passed to the prompt in order to decompose the queries accordingly. Assuming the system has two collections with titles: Islamic Fatwa, and An-najah National University. The knowledge domain topics will be embedded in the prompt and passed to the system. The model is prompted with the Arabic prompt referenced in Appendix G. This process ensures that the generated subqueries are semantically equivalent to the original query while being more suitable for subsequent agents. Query Classifier Agent (Query routing) The classification agent classifies the decomposed queries based on their relevance to the knowledge base's specialized domains. This agent uses collections’ metadata (name, title, and description), which is retrieved, reformatted as a structured string, and included in the LLM prompt. The output is a JSON object pairing each decomposed question with its classified collection. The specific collection’s metadata fields and their transformation 38 into string templates to facilitate the prompt for the classification agent is provided in Appendix H. Queries are classified as "irrelevant" if they don't align with available domains. The classification results, stored as a JSON object with "question" and "collection" keys, are easily accessible to other processes. The classification agent serves a dual purpose: it isolates irrelevant queries and determines the most appropriate collection for retrieving relevant documents. Candidate Answer Generation Agent The Candidate Answer agent generates potential answers for relevant queries (those not filtered by the Query Classifier Agent). It uses the HyDE technique for query expansion by generating an initial answer from a large language model without context. This initial, potentially inaccurate response helps retrieve more relevant documents. To maximize accuracy and relevance, the LLM is instructed to provide clear, concise answers to Islamic fatwa questions, excluding Hadith or Qur’an verses due to their sensitive nature. The prompt for candidate answers generation is referenced in Appendix I. Retriever Agent In the retrieval process, the decomposed relevant queries and the generated candidate answers are employed to retrieve relevant documents. In this research, the proposed hybrid retrieval approach integrates four similarity search techniques, each employing a Hierarchical Index structure. These methods retrieve the parent document, along with its attached metadata. A soft reminder that each document is represented by multiple embedding vectors to capture different aspects of its content. The similarity search techniques are: 1. query-to-answer (query 𝒔𝒊𝒎 → answer): similarity search between the query and the indexed answer. 2. HyDE (query + candidate answer 𝒔𝒊𝒎 → answer): combining the user’s query with the candidate and answer forming a new query for similarity search against the indexed answer. 3. query-to-question (query 𝒔𝒊𝒎 → question): similarity search between the query and the indexed question. 4. query-to-topic (query 𝒔𝒊𝒎 → topic): similarity search between the query and the indexed topic. 39 If the dataset is not question-answer based, the same techniques are used, but without employing the query-to-question approach. Alos, instead of matching against the answer field, the similarity search is performed on the “content” field. Each retriever returns the top-k most similar documents based on its retrieval criteria. The results from all retrievers are then combined into a single collection and filtered. This filtering process is guided by a ranking score, which is calculated using a ranking model, and the document frequency within the combined collection. The relevance score for each document is computed as the sum of its ranking score and its normalized frequency score. The final selection involves choosing the top-k documents based on a predetermined similarity threshold applied to the relevance score. Figure 6 illustrates the retriever agent pipeline, with the final output consisting of an array of documents. These documents retain the same structure and fields as the original input documents, aligned according to the field mapping defined by the user. Context Relevancy Agent The retrieved documents from the Retriever Agent often contain multiple documents related to the input query. These documents may include sentences that are not directly relevant to the answer, potentially influencing the model’s response. To address this, the retrieved documents are passed to a long-context language model to identify and extract the most relevant sentences. The prompt is referenced in Appendix J. The agent returns an array of the most relevant sentences related to the query, applying context compression to ensure the answer generation model focuses more effectively on the most relevant information. Answer Generation Agent The answer generation agent receives the decomposed relevant query along with relevant sentences identified by the context relevance agent. These inputs are integrated into a carefully crafted system prompt that defines strict guidelines for generating responses. The agent ensures that responses are clear, concise, and respectful, avoiding repetition of the query and maintaining a formal tone. When no context is available, it politely acknowledges its inability to provide an answer. This approach ensures that the agent delivers contextually relevant responses. The prompt for answer generation is provided in Appendix K. 40 Evidence Extraction Agent The Evidence Extraction Agent focuses on extracting Islamic daleel or legal references for legal queries. A prompt written in Arabic is passed to the LLM to identify and extract relevant Quranic verses, hadith verses, scholarly references, and legal references (such as article numbers) that support the generated response. The used Arabic prompt is referenced in Appendix L. In the QA pipeline, the workflow iteratively calls agents Candidate Answer Generation Agent through to Evidence Extraction Agent until all relevant decomposed queries have been addressed. Irrelevant Queries Response Agent The Irrelevant Queries Response Agent is called if the user’s query is deemed irrelevant to the knowledge base. This agent informs the user about the system’s capabilities and provides guidance on formulating appropriate queries in case the asked question was about a topic beyond the system’s knowledge. By leveraging the collection’s title metadata stored in the vector store, the agent can effectively communicate the system’s areas of expertise. To ensure a respectful and informative interaction, the agent is designed to respond politely to inappropriate user input, such as profanity or harmful questions. While safeguard models were tested (like Meta’s Llama Guard 3) to detect harmful content, challenges were encountered in identifying profane words, particularly in Arabic. Additionally, certain questions related to Islamic fatwa, such as marital issues, were mistakenly clas