An-Najah National University Faculty of Graduate Studies A HYBRID DEEP LEARNING MODEL FOR FORECASTING PM2.5 AIR POLLUTANT CONCENTRATIONS By Asma Mousa (Mohammad Ali) Massad Supervisors Dr. Anas Toma Dr. Abdelhaleem Khader This Thesis is Submitted in Partial Fulfillment of the Requirements for the Degree of Master in Artificial Intelligence, Faculty of Graduate Studies, An-Najah National University, Nablus - Palestine. 2024 IV List of Contents Declaration ...................................................................................................................... III List of Contents ............................................................................................................... IV List of Tables .................................................................................................................. VI List of Figures ................................................................................................................ VII List of Appendices ........................................................................................................VIII Abstract ........................................................................................................................... IX Chapter One: Introduction ................................................................................................ 1 1.1 Theoretical Basis......................................................................................................... 2 1.2 Related work ............................................................................................................... 2 1.3 Problem Statement and Study Objectives ................................................................... 7 1.4 Study Hypothesis ........................................................................................................ 8 1.5 Importance of the Study .............................................................................................. 9 Chapter Two: Foundational Concepts ............................................................................ 10 2.1 Attention Mechanisms .............................................................................................. 10 2.2 Graph Attention Networks ........................................................................................ 13 2.3 Time-series Transformers ......................................................................................... 16 Chapter Three: Methodology .......................................................................................... 18 3.1 Data Description and Preprocessing ......................................................................... 18 3.1.1 Data Analysis ......................................................................................................... 19 3.1.2 Data Preprocessing ................................................................................................ 25 3.1.3 Features Engineering ............................................................................................. 27 3.2 Model Architecture ................................................................................................... 30 3.2.1 DyGAT .................................................................................................................. 34 3.2.2 Informer ................................................................................................................. 39 Chapter Four: Experimental Design and Setup .............................................................. 41 4.1 Baseline Models ........................................................................................................ 41 4.2 Experimental Setup ................................................................................................... 43 4.2.1 Model Training and Optimization ......................................................................... 44 4.2.2 Model Evaluation and Validation .......................................................................... 47 Chapter Five: Results and Discussion ............................................................................ 50 5.1 Beijing Dataset .......................................................................................................... 50 5.1.1 DyGAT Performance Analysis .............................................................................. 52 V 5.1.2 Spatiotemporal vs Temporal Forecasting .............................................................. 54 5.1.3 Comparative Analysis with Baseline Models (Beijing Dataset) ........................... 55 5.2 Case Study - Nablus Dataset ..................................................................................... 59 5.2.1 Impact of Contextual Features on Forecasting Accuracy ...................................... 60 5.2.2 Comparative Analysis with Baseline Models (Nablus Dataset) ............................ 65 Chapter Six: Conclusions and Future Work ................................................................... 69 6.1 Conclusions ............................................................................................................... 69 6.2 Future Work .............................................................................................................. 69 List of Abbreviations ...................................................................................................... 71 References ....................................................................................................................... 72 Appendices ……..………………………………………………………………………… 79 ب ................................................................................................................................ الملخص VI List of Tables Table 1: Datasets Properties ........................................................................................... 19 Table 2: Statistical Summary for Numerical Features in Beijing Dataset ...................... 20 Table 3: Statistical Summary for Numerical Features in Nablus Dataset ...................... 25 Table 4: Models Training Parameters ............................................................................. 44 Table 5: Models Hyper-parameters after Tuning ........................................................... 46 Table 6: Evaluation Results of DyGAT-Informer vs. Informer for Beijing Stations (24- Hour Forecasting) .................................................................................................... 55 Table 7: Evaluation Results for Baseline Models over Different Forecasting Windows (Beijing Dataset). ..................................................................................................... 56 Table 8: Evaluation Results of DyGAT-LSTM and LSTM Models with and without GIS Data (Nablus Dataset).............................................................................................. 61 Table 9: Statistical Significance of Performance Differences between Models (Nablus Dataset) .................................................................................................................... 62 Table 10: Evaluation results of Baseline Models (Nablus dataset). ............................... 65 VII List of Figures Figure 1: Average PM2.5 Concentration by Month for Each Year for Beijing dataset ... 21 Figure 2: The Boxplots of the PM2.5 Values at Each Station in the Beijing Dataset. ..... 22 Figure 3: The Correlation Matrix of the Numerical Features in the Beijing Dataset. .... 23 Figure 4: Wind Direction Degree Mapping Chart .......................................................... 26 Figure 5: Overall Architecture of the DyGAT-Informer Model..................................... 32 Figure 6: DyGAT Architecture. (a) Overall Architecture of the DyGAT. (b) Detailed Architecture of a Dynamic GAT Layer. .................................................................. 35 Figure 7: Actual vs Predicted Values by DyGAT-Informer Model (Beijing Dataset) ... 51 Figure 8: DyGAT Attention Weights Visualization. (a) Nodes Attention Weights at 3 Different Time-steps. (b) Edge Attention Weights at 3 Different Time-steps. (Beijing Dataset) ...................................................................................................... 52 Figure 9: Evaluation Scores for the Models over Different Forecasting Windows for Beijing Dataset. (a) MAE, (b) RMSE, (c) SMAPE ................................................. 57 Figure 10: Actual vs Predicted Values by DyGAT-LSTM(GIS) Model (Nablus Dataset) ................................................................................................................................. 64 Figure 11: Evaluation Scores for the Models over Different Forecasting Windows for Nablus Dataset. (a) MAE, (b) RMSE, (c) SMAPE ................................................. 66 VIII List of Appendices Appendix A ..................................................................................................................... 79 Figure A.1: The Architecture of the Informer ................................................................ 79 Figure A.2: ST-GAT Architecture. (a) The Overall Architecture of the ST-GAT. ........ 79 Figure A.3: The Architecture of the ST-GCN ................................................................ 80 Figure A.4: Attention Weights Visualization from a Different Batch of Date. (a) Nodes Attention Weights. (b) Edges Attention Weights. ................................................... 81 Figure A.5: Actual vs Predicted Values for Baseline Models at One Station (Beijing Dataset) .................................................................................................................... 82 Figure A.6: Actual vs Predicted Values for Three Baseline Models (Nablus Dataset). . 83 Figure A.7: Actual vs Predicted Values for DyGAT-Informer Using “Learned” Temporal Embedding (Nablus Dataset). ................................................................. 84 Figure A.8: Actual vs Predicted Values for DyGAT-LSTM and DyGAT-Autofromer (Nablus Dataset). ..................................................................................................... 85 IX A HYBRID DEEP LEARNING MODEL FOR FORECASTING PM2.5 AIR POLLUTANT CONCENTRATIONS By Asma Mousa (Mohammad Ali) Massad Supervisors Dr. Anas Toma Dr. Abdelhaleem Khader Abstract Air quality forecasting is a crucial research field that aids scientists and policymakers in making informed decisions to combat air pollution. Among various pollutants, PM2.5 - particulate matter with a diameter smaller than 2.5 micrometers- poses significant health risks, as it can reach the lower respiratory tract and enter the bloodstream. Accurately forecasting PM2.5 levels is thus essential. Although machine learning-based spatiotemporal forecasting models have advanced, the pursuit for more accurate forecasts continues. The use of hybrid deep learning models for PM2.5 forecasting represents a promising and active area of research, as these models aim to capture complex spatiotemporal dependencies more effectively. We developed a Dynamic Graph Attention Network (DyGAT) to model spatial dependencies effectively. DyGAT leverages engineered edge features, including distance, wind speed, and wind direction, while using attention mechanisms to capture the dynamic nature of these dependencies. DyGAT was then combined with Informer, a Transformer for efficient time-series forecasting, to capture spatial and temporal patterns comprehensively, improving prediction accuracy. Our model was evaluated on a benchmark dataset from Beijing, with 420,768 records over four years. DyGAT-Informer outperformed a version without the DyGAT component and other baseline models. It achieved 50.43 for MAE, 79.9 for RMSE and 28.88% for SMAPE, compared to 51.44 for MAE, 80.83 for RMSE and 30.25% for SMAPE in the next best model. Additionally, we conducted a case study using a dataset from Nablus, Palestine, consisting of 2692 records per station over a two months period. We incorporated geospatial features about nearby pollution sources into the dataset. Due to the insufficient number of records in the Nablus dataset for training the Informer, it was replaced with a sequence-to-sequence Long Short-Term Memory (LSTM) model. DyGAT-LSTM, X trained with additional geospatial features about nearby pollution sources, achieved a 2.08% reduction in MAE, 1.17% in RMSE, and 1.96% in SMAPE. This confirms the benefit of incorporating such data. Finally, despite the short distances between stations, DyGAT successfully captured spatial dependencies, where DyGAT-LSTM achieved a reduction of 3.13% in MAE, 1.48% in RMSE, and 3.67% in SMAPE when compared to the LSTM-only model. Keywords: Spatiotemporal Forecasting, Air Quality, PM2.5 Forecasting, Hybrid Deep Learning Model, Graph Attention Networks, Transformers. 1 Chapter One Introduction Air pollution is one of the existential threats to humans in our modern societies. According to the World Health Organization (WHO) in 2019, 99% of the world's population live in places where air pollution levels are above the WHO air quality guidelines [1]. There are many types of air pollutants, the most common ones are Particulate Matter (PM), Carbon monoxide (CO), Ozone (O3), Sulphur dioxide (SO2) and Nitrogen dioxide (NO2). Each one of these pollutants has its own effects on the population’s health and other environmental issues [2]. In general, air pollutants can affect different systems and organs in the human body; they can cause cancers (particularly lung cancer) and they are correlated with decreased cancer survival rates [3]. Additionally, exposure to air pollutants is linked to chronic respiratory and cardiovascular diseases [4]. Air pollution is linked to one in nine global deaths and seven million annual premature deaths [5]. Particulate Matter (PM) is one of the prominent air pollutants that directly affect the population’s health [6]. PM it is a generic term to describe a variety of pollutants suspended in the air that may differ in their chemical and physical properties. PM particles are usually classified according to their aerodynamic diameter; the two common classes are PM10 and PM2.5. PM10 refers to particles that have an aerodynamic diameter less than 10 µm, while PM2.5 refers to particles with aerodynamic diameter less than 2.5 µm [2]. While all classes of PM pollutants can pose health risks to the human body, finer particles like PM2.5 are more hazardous [4]. Exposure to fine particles like PM2.5 worsens illnesses like asthma, cancer, and diabetes, while also harming mental health and cognitive development [5]. General examples of air pollutants sources are industrial emissions, fuel combustion, and photochemical reactions from plants. Agricultural and domestic use of herbicides, insecticides, are another source of air pollutants [2]. To deal with the hazards of air pollutants like PM2.5, governments and policy makers need models that can predict the concentration levels of these pollutants in order to plan future environmental policies and issue alerts to the population when PM2.5 levels are high. 2 1.1 Theoretical Basis Forecasting PM2.5 concentrations is crucial due to the significant health risks associated with fine particulate matter exposure. Accurate forecasts enable timely public health warnings, allowing individuals to take preventive measures during high pollution periods. Moreover, reliable forecasts assist policymakers in implementing effective air pollution control strategies, thereby improving overall air quality [7]. Additionally, air pollution control strategies are typically long-term and costly. Therefore, using air pollutants forecasting models, specifically for PM2.5, can assess the effectiveness of these measures by modeling trends and investigating the impact of applied interventions and air pollution control strategies [8]. PM2.5 forecasting models could be divided into traditional approaches and Machine Learning (ML) based approaches. Among traditional approaches, the most commonly used ones are Chemical Transport Models (CTMs) and statistical models. However, these approaches face many difficulties like the complexity of atmospheric processes over simplification of the underlying processes that produce PM2.5, the lack of adaptability – since these models are usually built for a specific region – and finally, their inability to capture the nonlinear behavior and interactions between the air components [9]. Due to recent advancement in relatively cheap and reliable measurement instruments, a large amount of historical air quality data became available. Additionally, the computation power has increased rapidly in the last few years, which encouraged researchers to explore a variety of machine learning based architectures that utilize historical data to overcome some of the limitations of traditional methods like modeling nonlinear relations and over simplification of the processes that happen in the atmosphere and affect the air quality [9] [10]. 1.2 Related work Machine learning models are becoming widely used tools for researchers to predict the air quality index and the concentrations of specific pollutants, such as PM2.5. A large variety of traditional machine learning algorithms and deep learning models were built to forecast air quality and pollutants concentrations, each model has its own features, strengths and weakness. The choice of a forecasting model depends on the available data and purpose of the research [7] [9]. 3 Machine learning approaches are divided into traditional ML and Deep Learning (DL) models. A recent survey [7] that reviewed machine learning models for air quality forecasting for the last ten years found that both traditional ML and DL models are used to build these models, however, deep learning based models are becoming more prominent in recent studies. Deep learning models are capable of modeling high dimensional data and nonlinear relations [9]. The DL models used in the literature for PM2.5 forecasting were either a variation of a single DL model or hybrid models that utilize more than one model to build a forecasting model that could deal with different aspects of the data [7]. For traditional ML models, Support Vector Machine (SVM) [11] and Random Forests (RF) [12] were the two most used algorithms. When it comes to DL, the most common models are Multi-Layer Perceptron (MLP) and Long Short-Term Memory (LSTM) neural networks, which is a variant of Recurrent Neural Network (RNN), followed by Convolutional Neural Network (CNN) [7]. Yao et al. [13] used Artificial Neural Network (ANN) to forecast daily PM2.5 levels, they used ground monitoring data and incorporated other sources like satellite images, they compared their model with traditional multiple regression models and their predictions were more accurate. Agarwal et al. [14] developed an ANN based model to forecast many air pollutants including PM2.5 and PM10, they provided daily forecasts and longer periods forecasts that could extend to four days. Their data had meteorological features and hourly pollution levels, and their model included a dynamic real time correction feature that can correct current predictions dynamically according to the model’s performance in previous days. The previous two studies [13][14] primarily focused on improving the prediction accuracy of traditional MLP networks without introducing significant changes to the architecture of the networks themselves. In contrast, other studies introduced more novel modifications to the MLP architecture to address different aspects in the forecasting process, or achieve specific goals. One study [15] presented an improvement to the performance of MLP by incorporating Kalman filter to the learning algorithm to adapt to the time variation of the air quality system. Another work [16] added a rolling mechanism and a gray model, which was used to preprocess the meteorological data in order to reduce 4 complexity. Other modifications included a hybrid model combining MLP and linear regression [17] for improving the accuracy of forecasting PM2.5 concentrations. In their model the regression model enhances the reliability of the predictions by correcting bias in the neural network's output. LSTM is another widely used method for PM2.5 forecasting due to its ability to handle time series data effectively. Zhou et al. [18] experimented with different variations of LSTM including a shallow and deep LSTM, while another work [19] built a hybrid LSTM-Kalman model, which performed better than the classical LSTM. One study [20] built a model based on LSTM, which they called Multi-output and Multi-index of Supervised Learning (MMSL); it was a spatiotemporal model where they tried to build a prediction model for one location by incorporating the data from other neighboring measuring stations. One variation of LSTM is Bidirectional LSTM (Bi-LSTM), which keeps information from the past and the future. Madaan et al. [21] built a Bi-LSTM based model with adaptive attention mechanism. Other works have used transfer learning with Bi-LSTM. For example, one study [22] used transfer learning with Bi-LSTM to transfer the knowledge learned within small temporal resolutions into a model that works with larger temporal resolutions. Another study [23] used transfer learning with Bi-LSTM to predict PM2.5 at new stations that do not have data. Bi-LSTM was also used in spatiotemporal models [24]. Zhang et al. [25] created a Bi-LSTM with Empirical Mode Decomposition (EMD) model to predict PM2.5 values. In their model, they only used historical PM2.5 data without any other meteorological data, where historical PM2.5 readings were considered as an input signal and the EMD was used as semi-supervised learning algorithm that extracts the hidden frequency features. Their model was developed to enhance the short-term predictions especially when sudden changes are present. Other studies used sequence-to-sequence (Seq2Seq) architectures, where both the input and output are sequences, Encoder-Decoder models are an example of Seq2Seq models. One study [26] built an Encoder-Decoder model with LSTM, their model showed significant results for long-term forecasting of PM2.5. Another study [27] utilized the 5 Encoder-Decoder for effective PM2.5 prediction along with Genetic Algorithm for feature selection and outlier removal to enhance the forecasting accuracy. Researchers also experimented with hybrid deep learning models to overcome some of the single algorithm models’ limitations. These hybrid models became more popular recently due to the rapid increase in computational power. One popular hybrid model is a combination between LSTM and CNN, in some studies [28] [29] that developed hybrid CNN-LSTM, the CNN was used to extract features related to the air quality while the LSTM is used to model the historical process of the time-series data. While Le et al. [30] developed a CNN-LSTM model where they used the combination to manipulate the spatial and temporal features of their data, which included traffic volume data along the PM2.5 and meteorological data. Zhang et al. [31] built a hybrid model that combines Variational Mode Decomposition (VMD) with Bi-LSTM. The VMD was used to decompose the original time series data signal into multiple sub-signal in the frequency domain, their work had better performance than models that used EMD for signal decomposition. Another way to utilize hybrid models is to build powerful spatiotemporal models by using Graph Neural Networks (GNN) to model the spatial relations, along with another deep learning model to deal with the temporal dependency in the data. Using LSTMs with GNNs is a popular combination to create hybrid spatiotemporal forecasting models. Some studies used a variant of GNN called Graph Convolutional Network (GCN) with LSTM. Qi et.al. [32] employed a GCN to capture spatial dependencies in the data and LSTM network to model temporal dependencies, and the final forecasts are generated by passing the output through a Fully Connected Network (FCN). To construct the weighted adjacency matrix, they used a formula that calculates the spatial distance between stations. In their approach, nodes are considered connected only if the distance is within 200 km; otherwise, the adjacency matrix entry is set to zero. Teng et.al. [33] used GCN- LSTM architecture similar to the previous study, however they incorporated Aerosol Optical Depth (AOD) data to improve the accuracy of PM2.5 forecasts. AOD is a measure of how much aerosol is present in a column of the atmosphere. Another study [34] created a hybrid model using GCN with self-loops and an LSTM with temporal sliding window, to forecast multiple air pollutants. The temporal sliding window moves over the time- 6 series data to generate overlapping sequences, which are then used to train the LSTM network. The previous studies [32], [33], [34] used sequential architecture, where the output of GCN is used as input to the LSTM, however a study by Gao et. al. [35] used a parallel integration method between GCNs and Bi-LSTM, where the outputs of the two models are concatenated and passed to a FCN to create forecasts. In addition, one study [36] used GCN with Gated Recurrent Network (GRU) which is another RNN variant. Some studies [37] used another GNN variant called Graph Attention Network (GAT) [38] with LSTM, and another one combined GAT with GRU [39]. Zhou et. al. [40] introduced another GNN based hybrid model, where they used GCN for spatial dependencies and temporal convolution to obtain temporal features, their model used wind-field diffusion distance to describe the relation between each pair of nodes instead of typical Euclidian distance. There are other deep learning algorithms that were used to build PM2.5 forecasting models, such as the Deep Belief Network (DBN) [41] and Autoencoder Neural Network [42]. After the introduction of Transformers [43], many researchers started experimenting with variations of Transformer architectures and hybrid models that include them. Liang et al. [44] built a Transformer based PM2.5 prediction model and according to their results, the model was able to predict air quality with fine spatial granularity that was not achieved before. Al-qaness et. al. [45] built ‘ResInformer’ a model with the Informer [46], which is a Transformer variant, introduced to improve the inference speed of long-sequence predictions. Their work improved the attention distillation block in the Informer. Ma et. al. [47] used the Informer in a spatiotemporal PM2.5 forecasting model; the key innovation is adding a spatiotemporal embedding layer to the Informer to model the spatial dependencies in the data. MSAFormer [48] is another Transformer based PM2.5 forecasting model, the model uses Transformer with Sparse Autoencoding extracts the most important features from the vast amount of multi-site meteorological data, focusing on the information most relevant to PM2.5 prediction. Zhang et al. [49] built an Encoder- Decoder PM2.5 prediction model with Sparse attention-based Transformer Networks (STN), they used the sparse attention approach to reduce the time complexity, their results 7 shows that the model has a small time complexity and has outperformed state-of-the-art models. Other Studies used Transformers in hybrid models, one study [50] used Informer with GCN, their aim was to improve air quality forecasting by capturing the dynamic and intricate relationships between air pollutants and their environment. Graph Transformers is another hybrid concept that was introduced in the literature [51], this hybrid model was used for many tasks including time-series forecasting tasks. Li et. al. [52] introduced ‘Forecaster’ which is spatiotemporal forecasting model based on graph Transformers. Graph Transformers were also used in building PM2.5 forecasting models. One study [53] introduced Temporal Difference based Graph Transformer Networks (TDGTN) they utilize temporal difference techniques to learn long-term dependencies in PM2.5 concentration data. Although spatiotemporal forecasting models have recently been investigated and used to forecast PM2.5 concentrations, many existing approaches do not fully address the dynamic nature of spatial relationships among different locations. Traditional models and hybrid models have demonstrated success in capturing time-dependent patterns and spatial correlations. However, they often rely on static spatial representations, usually governed by the geolocation of the monitoring stations, which overlook the fact that spatial dependencies between stations can change over time due to factors like weather conditions that affect the dispersion process of pollutants such as PM2.5. Moreover, while some models do integrate meteorological data into spatial dependency modeling, they typically do not account for the directionality of pollutant dispersion between each pair of monitoring stations. 1.3 Problem Statement and Study Objectives Despite recent advancements in PM2.5 forecasting models, there is still significant potential to improve how these models capture the complex and non-linear spatiotemporal patterns of PM2.5 concentrations. Moreover, the integration of geospatial data, such as geographical information about pollution sources into deep learning models remains an underexplored area. 8 The primary objective of this study is to develop a hybrid spatiotemporal PM2.5 forecasting model that addresses the dynamic spatial dependencies between locations and incorporates domain knowledge about pollutant dispersion. This hybrid model consists of two parts: a dynamic variant of GAT that we have developed to capture the dynamic spatial dependencies and the Informer [46], a Transformer designed for long-range time- series forecasting, to model temporal dependencies. Unlike existing approaches that rely on static spatial connections, our DyGAT model introduces an attention-based dynamic adjacency matrix that evolves over time, reflecting changing patterns in PM2.5 concentrations. In addition, we engineered directional edge features based on geolocation, wind direction and wind speed. These edge features highlight important features that affect pollutants dispersion across the region and provide directionality to the spatial relationships. The model will be trained and evaluated using the benchmark dataset ‘Beijing Multi-Site Air-Quality Dataset’ [54], which provides comprehensive air quality measurements and meteorological data from multiple locations in Beijing. The second objective of this study is to investigate the effect of incorporating additional geospatial data about nearby pollution sources on forecasting accuracy of the hybrid spatiotemporal forecasting model. To address this, we will use an air quality dataset from the city of Nablus in Palestine, as provided by Saleh et al. [55]. This dataset includes details on pollution sources categorized by their hazard levels. We will first evaluate the model using only air quality-related features and then integrate the pollution source data to assess its impact on forecasting accuracy at each station. To accommodate the small size of the Nablus dataset, the temporal forecasting component of the hybrid model, represented by the Informer, was replaced with a Seq2Seq LSTM model. The study aims to enhance spatiotemporal PM2.5 forecasting by developing a model that effectively captures dynamic spatial dependencies and temporal patterns. It also examines how incorporating additional information about nearby pollution sources for each station influences forecasting accuracy. 1.4 Study Hypothesis We hypothesize that integrating a dynamic variant of Graph Attention Network (DyGAT) to capture spatial dependencies with the Informer model for capturing the temporal dependencies will create a hybrid spatiotemporal forecasting model that improves PM2.5 9 concentration forecasting. Specifically, we expect that this hybrid model will outperform models that use only temporal forecasting without spatial dependency integration. In the case study using local data from Nablus, Palestine, we anticipate that incorporating additional contextual features about nearby pollution sources will enhance forecasting accuracy. 1.5 Importance of the Study This study has a substantial importance for several reasons. It addresses a public health issue by enhancing the ability to predict PM2.5 concentrations accurately Additionally, it contributes to the field of environmental science and deep learning by exploring novel techniques for PM2.5 forecasting, which can be applied to various regions worldwide. Finally, it investigates how integrating geospatial and pollution sources information into spatiotemporal PM2.5 forecasting models affects forecasting accuracy, providing insights that have implications for air quality research and environmental policies. 10 Chapter Two Foundational Concepts This chapter presents the theoretical and mathematical foundations of the components that form the DyGAT-Informer model. The following sections provide a comprehensive overview of the key components, including Attention Mechanisms, Graph Attention Networks, and Time-series Transformers. These components are the base of the model's architecture. Each section includes relevant mathematical equations and theoretical concepts that are crucial for understanding the model's design and functionality. 2.1 Attention Mechanisms During the construction of the DyGAT model, we used different attention mechanisms for different tasks within the model. The final version of the DyGAT model uses three different attention mechanisms; therefore, we will present an introduction to attention mechanisms before presenting the ones we used in the Methodology chapter. In the context of deep learning, an attention mechanism refers to a computational mechanism that enables neural networks to selectively focus on certain parts of the input data while ignoring others, similar to the cognitive attention mechanism in the human brain that focuses on important elements in the environment while ignoring non-relevant information [56]. Attention mechanisms in neural networks help models weigh the importance of different input elements dynamically. They are general mechanisms so they are used in different types of deep learning architectures in computer vision, Natural Language Processing (NLP) and in other models that work with sequential data [57]. Although attention mechanisms have existed for a long time, their popularity started to rise after the year 2015, where they were used in many studies in machine translation and image captioning. In that time attention mechanisms were typically used with recurrent or convolutional layers in LSTMs and CNNs. However, in 2017 the Transformers [43] were introduced, they were built entirely using self-attention, which is a type of attention mechanisms [58]. Attention mechanisms have different types and variations of these types. One model can be built by combining different types of attention techniques. Several surveys [56], [57], 11 [58], presented different taxonomies for attention mechanisms types, we summarized them as follows:  Soft and Hard Attentions: Soft attention assigns weights to each element in the sequence, allowing the model to focus on multiple parts simultaneously. While hard attention, chooses a single element to attend to at each step, making a more definitive choice. Soft attention is more commonly used in the literature.  Self-Attention: Computes attention weights within the same sequence, allowing each element to attend to other elements, including itself. It is widely used in sequence modeling tasks like machine translation and sentiment analysis.  Multi-Head Attention: Employs multiple sets of attention weights to capture different aspects or representations of the input. This enables the model to attend to various parts of the input simultaneously, providing richer context.  Scaled Dot-Product Attention: A form of self-attention where attention scores are computed by taking the dot product of query and key vectors, followed by scaling and softmax normalization. Commonly used in Transformer-based models.  Local Attention: Focuses on a subset of nearby inputs rather than the entire input sequence. This can improve efficiency, particularly for long sequences.  Global Attention: Considers the entire input sequence when computing attention weights, potentially attending to all elements in the sequence. Useful for capturing long-range dependencies. It should be noted that there are other ways to classify attention mechanisms. The taxonomy we presented is a summary of the four surveys we chose. These attention mechanisms are usually used in encoder-decoder architectures, like RNN and its variants LSTM and GRU, and in Transformers, specifically self-attention. They are also used in other architectures like CNNs, Memory Networks, GNNs and hybrid architectures. Each attention mechanism variation has its own customized mathematical representation. We will provide a generalized mathematical representation for attention mechanisms. A general attention mechanism computes a context vector ci for each input element i based on its relevance to the current context. To compute ci we need a Query matrix (Q) 12 and a Key matrix (K), they could have different names depending on the model’s architecture or attention mechanism type, but generally, they are defined as:  Query: Represents the element of interest for which we want to compute the attention scores. It serves as a reference point or the focus of attention.  Key: Keys are a representation of other elements in the input sequence. Each key is compared with the query to determine how relevant it is to the query. Keys help in assessing the importance or relevance of different elements in the input sequence with respect to the query. To compute the context vector, we first need to compute attention scores also known as energy (e): eij = 𝑓(qi, kj) (2.1) Where f is called a score function – also called compatibility or alignment function. The score function determines the similarity or compatibility between a query and a key, ultimately influencing the attention weights assigned to each key. There are different types of score functions used in attention mechanisms. Examples of score functions are: Additive, Multiplicative (dot product), Scaled multiplicative, Concat, Location-based, Similarity and Cosine-similarity based score functions [56] [59]. After calculating the attention scores, we then calculate the attention weights by applying softmax function to the attention scores, αij = softmax(eij) = exp (eij) ∑ exp (eij)n j=1 (2.2) Finally, we compute ci as follows: ci = ∑ αij n j=1 . vj (2.3) Where vj denotes the value or feature representation of input element j, and n is the total number of input elements. 13 2.2 Graph Attention Networks A graph is a data structure used to represent relationships between objects. It is made up of nodes (or vertices) and edges. Nodes are the objects or entities, while edges show the relationships between them. Edges can be either directed, where the relationship has a direction from one node to another, or undirected, where the relationship is bidirectional between two nodes without a specific direction. Graphs can also be weighted or unweighted. In a weighted graph, edges have different weights, which can add another layer of meaning to the connections. Unweighted graphs, on the other hand, treat all edges equally [60]. An adjacency matrix is a way to represent a graph using a grid. It is a square matrix where each cell at position (i, j) indicates whether there is an edge between node i and node j. For a directed graph, the cell value shows the presence and direction of the edge, while for an undirected graph, it just shows the presence of an edge. In a weighted graph, the cell can also contain the weight of the edge. This matrix provides a compact and convenient way to store and work with graph data [60]. Graphs are versatile and can be used in algorithms like Dijkstra’s for finding the shortest paths, and in machine learning models like Graph Neural Networks (GNNs) to handle complex relationships in data. Graph Neural Networks [61] are a class of deep learning models specifically designed to work with graph-structured data, like transportation networks, social networks, and biological data (e.g. genes, proteins, etc.). Unlike traditional neural networks, GNNs are designed to excel at capturing the relationships and dependencies between connected nodes in a graph, making them particularly effective for tasks where the structure of the data is important. Graph Attention Network (GAT) [38] was introduced in 2018 as a GNN variant that uses attention mechanisms for learning features on graphs. Some GNN variants like vanilla GCNs do not utilize attention mechanism, so they aggregate information from the node’s neighbors equally, assuming all neighboring nodes have the same influence on one node. In contrast, GATs use attention mechanisms to focus on key features of neighboring nodes. Each neighboring node is assigned a different weight. GATs do not require a previous knowledge of the graph structure. The first GAT presented by Veličković et. al. [38] used self-attention for node classification tasks in graph-structured data. They also 14 used Multi-head attention with concatenation of the attention heads outputs to provide stability for the learning process. GATs in general refer to a class of GNNs that utilize attention mechanisms to dynamically weight the importance of neighboring nodes and learn complex, context-dependent relationships within a graph. There are various types of GATs, with different taxonomies that categorize them based on factors like attention mechanisms, architecture, and application. Several surveys have been conducted to explore different types of GATs, providing detailed taxonomies and classifications. For instance, A survey [62] done to the attention mechanisms used in GNNs, classified the attention mechanism used in Veličković et. al. [38] work as “learnable attention” where the attention weights are learned. The survey also identified two other classifications of attention mechanisms in the literature. The first is “Similarity-based attention” which is also a learned attention but it allocates greater attention to objects sharing more similar hidden representations or features. The second classification is “Attention-guided walk”, which is a type that utilizes attention mechanisms to guide the traversal process, unlike traditional random walks that traverse the graph uniformly or according to predetermined rules. Another Survey [59] classifies attention mechanisms in GNNs into a two-level taxonomy. The upper level divides attention mechanisms into three types based on their high-level architectural differences: Graph Recurrent Attention Networks (GRANs), Graph Attention Networks (GATs), and Graph Transformers. GRANs focus on integrating RNNs with attention mechanisms for graph data. GATs, introduces attention directly into graph nodes, allowing nodes to weigh their neighbors' importance. Graph Transformers, leverages transformer-based architectures for graph data. The lower level of the taxonomy categorizes attention mechanisms in GNNs based on architectural designs within the three main categories. Finally, a comprehensive review paper [63] categorized Graph Attention Networks (GATs) into six main types: 1. Global Attention Networks: which focus on the overall graph structure. 2. Multi-Layer GATs: which utilize multiple layers for deeper feature extraction. 15 3. Graph-embedding GATs: which utilize graph-embedding techniques to learn richer and more informative node representations. 4. Spatial GATs: which incorporate spatial information for more accurate modeling. 5. Variational GATs: which incorporate variational inference to effectively model complex, heterogeneous, and multimodal data across various domains. 6. Hybrid GATs, which combine various strategies for enhanced performance. GATs have become widely utilized across various domains due to their ability to capture complex relationships within graph-structured data. They are particularly effective in node and graph classification, where they classify individual nodes or entire graphs based on their features. GATs excel in link prediction, where they estimate the likelihood of connections between nodes. Additionally, they are utilized in recommendation systems, where they enhance user-item interactions and predict user preferences. GATs are also used in traffic forecasting, where they model road networks to forecast traffic patterns. Moreover, they are used for molecular graph analysis, where they predict molecular properties. In image analysis, GATs improve tasks like segmentation and object detection by capturing spatial relationships. GATs are applied in medical fields for disease prediction and analysis of biological data, as well as in natural language processing, enhancing sentiment analysis by capturing contextual dependencies in text. Finally, GATs are employed in anomaly detection, including fraud detection and network security. These varied applications demonstrate the versatility and effectiveness of GATs in handling graph-based tasks across multiple fields [59][63]. Despite the effectiveness of GATs in handling graph-based data, they face several challenges. These challenges include computational complexity, as their cost increases with larger graphs, particularly due to the attention mechanism's complexity, leading to scalability issues. GATs can also suffer from over-smoothing in deep architectures, where node features become indistinguishable across layers, and they may struggle with capturing long-range dependencies in large graphs. Additionally, overfitting is a concern, particularly when data is limited or noisy. Moreover, interpretability remains a challenge, as understanding the reasoning behind attention weights can be difficult. Other limitations include high memory consumption due to attention weight storage and vulnerability to noisy data, which can reduce the robustness of GATs performance [59][62][63]. 16 2.3 Time-series Transformers The introduction of Transformers [43] has revolutionized various fields including natural language processing and image recognition. Self-attention mechanism, enabled them to capture long-range dependencies within sequences, and they have proven to be highly effective in modeling complex relationships [64]. The basic architecture comprises of the following components:  Encoder: The encoder consists of multiple identical layers (blocks), where each one contains a multi-head self-attention mechanism and a position-wise feed-forward neural network.  Decoder: The decoder also comprises of multiple identical layers, however, in addition to the self-attention and feed-forward network present in the encoder blocks, each decoder block incorporates cross-attention, which is used to attend to the encoder's output.  Self-Attention Mechanism: The self-attention mechanism allows a token in the input sequence to attend to all other tokens in the same sequence. This mechanism enables the model to capture long-range dependencies efficiently.  Positional Encoding: Transformers do not inherently understand the sequential order of tokens, which is why positional encoding is usually used to add positional information to the input embeddings. This allows the model to differentiate between the positions of tokens in the sequence. The Transformers ability to work with long-range sequences made them a desirable candidate for building time-series forecasting models. In the past few years, many time- series Transformer variants have emerged. Time-series Transformer models presented different modifications to the vanilla Transformer to make it suitable for time-series forecasting. These modifications were done at different architectural levels. One survey [64] divided the modifications to the vanilla Transformer in time-series Transformer into two main categories, either modifications to the existing architectures, or new architectural innovations. The first modification to the existing components is presenting new positional encoding methods like “Learnable Positional Encoding”, where the model can learn the positional 17 encoding from the input sequence. Another positional encoding technique is “Time-stamp Encoding”, where the time related features (hour, day, year, holidays etc.) are used as a positional encoding method. The second type of modifications to the original components is modification to the attention module. The original Transformer has a memory and time complexity of O(N2), N is the length of the sequence, this poses a computational bottleneck when dealing with long-sequences. So some time-series Transformers added sparsity to the attention mechanism to reduce the memory and time complexity, examples of these Transformers are Informer [46] , LogTrans [65] and Pyraformer[66] and many others. An example of architectural innovations to the Transformer is presented by the Informer, which has incorporated max-pooling layers between attention blocks. On the other hand, Pyraformer utilizes a C-ary tree-based attention mechanism. Other time-series forecasting Transformers used signal decomposition to enhance the model’s forecasting abilities like Autoformer [67] and FEDfromer [68]. In addition, they have presented novel attention mechanisms, where the Autoformer introduced a novel auto-correlation mechanism that analyzes the data's periodicity to identify and aggregate similar sub-series, which enables the model to capture dependencies within the data more efficiently. FEDformer was built on the Autoformer, however it has introduced its own Frequency Enhanced Attention (FEA) mechanism. 18 Chapter Three Methodology In this chapter, we will present a thorough description of the datasets used in the study and provide insights acquired during the exploratory data analysis. In addition, we will explain the pre-processing techniques used to prepare the data for training and testing. Moreover, we will provide a detailed breakdown of the architecture of the proposed model. 3.1 Data Description and Preprocessing In this study, the Beijing Multi-Site Air-Quality Dataset [54] will be the main dataset used to train and test the proposed model. The dataset contains a collection of spatiotemporal air quality measurements and meteorological data across 12 monitoring stations in Beijing. The Beijing air quality data was collected from March 1st, 2013, to February 28th, 2017. The second dataset that will be used is an air quality dataset collected from 8 measuring stations in Nablus city in Palestine. The data was collected from January 6th 2022, to March 3rd 2022. It includes meteorological data for the city of Nablus, obtained from the 'Time and Date' weather website, covering the same period as the air quality dataset. This dataset was collected as a part of a study by Saleh et al. [55], where they presented a methodology for selecting air quality monitoring locations based on low-cost sensors and Geographic Information Systems (GIS) [69]. The distances between the measuring stations in the Nablus air quality dataset are considered small, where the longest distance between a pair of nodes is 10.93 km. Therefore, the weather data was the same for the whole city of Nablus, hence the only differences between nodes’ features are the geolocation (longitude and latitude) and PM2.5 readings. However, we chose this data to study another factor that is usually missing in other air quality datasets, which is the information about nearby pollution sources. A general description of both datasets is presented in Table (1). One notable difference between the two datasets is the length of the time-step, where the Beijing dataset has a 19 time-step of 1 hour while the Nablus dataset has a time-step of one minute for PM2.5 and 30 minutes for meteorological data. Table 1 Datasets Properties Properties Beijing Nablus Number of stations 12 8 Number of Features 16 11 Number of Records 420768 PM2.5: 542607; Meteorological Data: 2531 Timestamp Interval 1 Hour PM2.5: 1 Minute; Meteorological Data: 30 Minutes Time Span March 1st, 2013, to February 28th, 2017 January 6th 2022, to March 3rd 2022 Although the Nablus dataset contains a high number of records due to the PM2.5 data being recorded every minute, the final number of records per station is 2692, and the total number of records across all stations is 21536. This reduction is due to the meteorological data being recorded at a 30-minute interval. For the Beijing dataset, there are 35064 records per station, and the total number of records is 420,768. 3.1.1 Data Analysis The Beijing dataset consists of measurements from 12 stations, to perform data analysis, the data from all 12 stations were combined. The raw dataset contained 18 features, including 16 numerical features and two categorical features. Out of the 16 numerical features, four features represent temporal information (year, month, day, hour), one feature is the ordinal number of the timestamp. The remaining 11 numerical features are related to meteorological and air pollutant features. The two categorical features are “station name”, “wind direction”. Two additional features, longitude and latitude, were added for each station. Wind direction is a categorical feature with 16 categories, where for example, ‘W’ means west, ‘SW’ means southwest, and ‘WSW’ means west- southwest. Table (2), presents a statistical summary of the relevant numerical features. The summary includes count, mean, minimum, maximum, standard deviation and Coefficient of Variation (CV) for each feature. The statistical summary provides valuable insights into 20 the distribution and range of these variables. For example, Atmospheric Pressure (PRES) demonstrates low variability in the data points with a CV of 1.04%. Features such as PM10, NO2, TEMP, and WSPM exhibit moderate variability, suggesting a noticeable but not extreme spread in their data values. Table 2 Statistical Summary for Numerical Features in Beijing Dataset Feature Description count mean min max std CV PM10 PM10 concentration in µg/m3 415037 104.57 2.00 999.00 91.70 87.67% SO2 Sulfur dioxide concentration in µg/m3 412682 15.82 0.29 500.00 21.63 136.74% NO2 Nitrogen dioxide concentration in µg/m3 409675 50.64 1.03 290.00 35.08 69.25% CO Carbon monoxide concentration in µg/m3 401843 1229.30 100.00 10000 1157.82 94.21% O3 Ozone concentration in µg/m3 409210 57.31 0.21 1071.00 56.55 98.67% TEMP Temperature in degrees Celsius 420390 13.54 -19.90 41.60 11.44 84.50% PRES Atmospheric Pressure in hPa 420395 1010.75 982.40 1042.80 10.47 1.04% DEWP Dew point temperature in degrees Celsius 420385 2.49 -43.40 29.10 13.79 553.01% RAIN Rain precipitation in mm 420398 0.06 0.00 72.50 0.82 1366.67% WSPM Wind speed in m/s 420464 1.73 0.00 13.20 1.25 72.25% PM2.5 PM2.5 concentration in µg/m3 412954 79.74 2.00 999.00 80.74 101.25% 21 High variability is observed in SO2, CO, O3, and PM2.5 with high CVs, indicating a broad range of data points. Notably, DEWP and RAIN show extremely high variability, with CVs that suggest substantial fluctuations relative to their means. To better understand the temporal variations in PM2.5 concentrations, Figure (1), presents a bar chart depicting the average PM2.5 concentration across different months from 2013 to 2017. Each bar represents the average PM2.5 concentration for a particular month. It seems that there are high variabilities in the PM2.5 average values for the same month over the years. A clear example is February, where in 2014 it had the highest average PM2.5 concentration of the year, while in 2016 it had the lowest average concentration of the year. Figure 1 Average PM2.5 Concentration by Month for Each Year for Beijing Dataset Figure (2) shows boxplots for the PM2.5 values at each station for the Beijing dataset. The boxplots visualize the distribution of the data. They offer insights into our data, including the 25th percentile (first quartile (Q1)), the median (Q2), the 75th percentile (third quartile (Q3)). They also highlight the minimum and maximum values, as well as the presence of outliers. Each horizontal boxplot represents a specific station, and the vertical axis displays PM2.5 values. The box for each station provides a summary of the data distribution. The elements of the boxplot are as follows [70]: 22  Box: It represents the Interquartile Range (IQR) which extends from the first quartile (Q1) to the third quartile (Q3), representing the middle 50% of the data. The line within the box indicates the median (Q2) PM2.5 value.  Whiskers: The lines extending from the box indicate the range of the data within 1.5 times the IQR.  Outliers: Individual data points plotted beyond the whiskers are considered outliers. Figure 2 The Boxplots of the PM2.5 Values at Each Station in the Beijing Dataset. Most stations show similar distributions in their PM2.5 values, with many having their median values close to 50. There are many outliers especially at higher PM2.5 values, which indicates some spikes in pollution levels higher than the typical range for each station. Some stations have wider IQRs and more outliers, indicating greater variability in PM2.5 values. The Wanshouxigong station has the highest outlier value (close to 1000), suggesting an episode of severe pollution. 23 We have decided to keep the outliers in the data because in the context of environmental data like PM2.5 levels, outliers often represent real and extreme events like pollution level spikes due to weather conditions, industrial activities, or traffic. In this case, including outliers allows our model to learn and adapt to these extreme events, providing more realistic and comprehensive understanding of air quality dynamics. This will lead to a more robust model that can provide better forecasts. Secondly, air quality forecasting models are often used in scenarios where predicting extreme events is crucial. Which makes including outliers in the training data essential, especially for cases where the forecasting model is used in early warning systems. To ensure that these extreme values are not measurement errors or noise, we examined these extreme PM2.5 level across all stations. We observed that these extreme values occurred consistently across all measuring stations, making the likelihood of an instrument error very low. Additionally, the extreme values remain within the possible range for PM2.5 concentration levels. Figure (3) shows the correlation matrix for the Beijing dataset, which provides a visual representation of the relationships between the numerical features in the Beijing dataset. Figure 3 The Correlation Matrix of the Numerical Features in the Beijing Dataset. 24 The matrix highlights both positive and negative correlation coefficients (r), with darker colors indicating stronger positive relationships. PM2.5 shows a strong positive correlation with PM10 (r = 0.88), CO (r = 0.77), and SO2 (r = 0.66), which suggests that these pollutants’ levels are related. One possible reason for the high positive correlation between these pollutants is having similar sources such as vehicle emissions and industrial activities. We have decided to include all of the features when training the hybrid spatiotemporal PM2.5 forecasting model to fully leverage the comprehensive nature of the dataset. Each feature, such as various air pollutants (PM2.5, PM10, NO2, CO, O3, SO2), meteorological data (temperature, pressure, dew point, wind speed, wind direction, precipitation), and temporal information, offers unique insights into the complex atmospheric processes affecting PM2.5 levels. The DyGAT-Informer hybrid model is well-equipped to handle feature interactions, even those involving highly correlated variables. GAT’s attention mechanism allows it to focus on the most relevant relationships between features, while Informer’s self-attention mechanism identifies and prioritizes the temporal dependencies that matter most. By utilizing the full feature set, the model can capture both direct and indirect interactions among features, thereby enhancing its ability to forecast PM2.5 levels with greater accuracy. Regarding the Nablus dataset, as stated before the dataset was created by combing the PM2.5 readings from 8 locations in Nablus city and weather data retrieved from the weather website. The dataset from Nablus contains 11 features, including 8 numerical features, 2 categorical features, and 1 temporal feature. The numerical features are latitude and longitude (spatial coordinates), temperature, humidity, atmospheric pressure, visibility distance, rain and PM2. The categorical features are station name and ‘Weather’. And finally, the temporal feature represents the timestamp of each measurement. The categorical feature ‘Weather’ is a description of the weather in phrases separated by dot (e.g. “Light rain. Partly cloudy.”). The ‘Rain’ feature is binary, with a value of zero indicating no rain and one indicating rain. The textual description of the amount of rain is stated in the ‘Weather’ feature. 25 Table (3) presents statistical summary for the numerical features. There seems to be some errors in the measurements, because the maximum value in PM2.5 (17188.39 µg/m3) is not reasonable, and the minimum value is (-1) which is also not possible. Table 3 Statistical Summary for Numerical Features in Nablus Dataset Feature Description count mean std min max CV Temp Temperature in Fahrenheit. 21536 54.81 6.291 37 81 11.48% Wind speed Wind speed in mph. 21536 7.44 4.90 0 37 65.8% Humidity Humidity as a percentage. 21536 0.71 0.16 0.18 1 23.1% Barometer Atmospheric pressure in "Hg. 21536 30.04 0.12 29.74 30.39 0.42% Rain Binary (raining or not). 21536 0.41 1.14 0 1 278.4% PM2.5 PM2.5 concentration in µg/m3 21536 22.85 137.64 -1 17188.39 601.6% The CV analysis reveals that ‘Barometer’ feature has low variability in data points, while Temp, Wind speed, and Humidity show moderate variability. Finally, PM2.5 have high variability, indicating substantial fluctuations. For the Nablus dataset we had to deal with some extreme outliers that do not represent real pollution level but rather a measurement error. For example, at the “Unit_F_Hijjawi” station, the maximum PM2.5 value was 17188.39 µg/m3, which is highly improbable. In addition, the minimum value at the “118_NNUH” station was -1, which is not possible because PM2.5 values cannot have negative values. 3.1.2 Data Preprocessing The first step in preprocessing the data is examining the number of missing values for each feature, and filling them with an imputation method. For the Beijing dataset, the 26 missing values for meteorological features were less than 0.1%. However, for pollutant features, the missing values were mostly around 2%, except for CO, which had 4.92%, and O3, which had 3.16%. For the Nablus dataset, the only two features with missing values were wind speed, with 0.04%, and visibility, which had a much higher percentage of missing values at 44.76%. Since this is a time-series data, simple imputation methods like mean and median may not be suitable because a value at one time-step is related to values around it more than values at distant time-steps. For the numerical features, an Iterative Imputation method was used, which uses a multivariate imputation algorithm to estimate missing values iteratively. It treats each feature’s column with missing values as a target variable and uses the other columns as predictors to estimate the missing values. At each iteration, the Iterative Imputer uses a regression model to predict the missing values. For the categorical features like wind direction, a k-nearest neighbors Imputer was utilized, which is used for imputing missing values in datasets using the k-nearest neighbors algorithm. It imputes missing values based on the values of neighboring data points, and it can handle categorical data. In the Beijing dataset, we encoded the wind direction by converting the textual description of the direction into angular degrees, so we ended up with 16 different degrees to describe the wind direction. Figure (4) shows the wind direction mapping from textual categories to degrees. Figure 4 Wind Direction Degree Mapping Chart. 27 The PM2.5 measurement in the Nablus dataset had temporal granularity of 1 minute, while the meteorological measurements had a 30 minutes temporal granularity. We aggregated PM2.5 measurement into 30 minutes time interval to match the meteorological data. As stated in the section [Data Analysis], the Nablus dataset had some outliers due to measurement error, so we had to fix them. The negative PM2.5 value of (-1) was changed to zero. Moreover, the extremely high value was treated as missing and filled during the imputation process. Regarding the ‘Visibility’ feature (visibility distance), it was removed due to approximately 45% of the values being missing, and the majority of the non- missing entries having a value of 10 miles, offering little variability. We applied min-max normalization to the data, to ensure a uniform scaling of features and improve model performance, especially when the features have different ranges. Min- max normalization transforms each feature to a common scale by mapping its values to a range between 0 and 1 using this formula: xnorm = x−min (x) max(x)−min (x) (3.1) where min(x) and max(x) are the minimum and maximum values of the feature, respectively. Other normalization methods were explored during the initial experiments, and min-max normalization yielded the best results in terms of evaluation metrics. 3.1.3 Features Engineering In the previous section, we addressed the data pre-processing techniques that have been used like filling the missing values, normalization and encoding of categorical data. In this section, we will present the feature engineering techniques we used in both datasets. Feature engineering involves creating new features or modifying existing ones to improve the performance of machine learning models. Starting with the Beijing dataset, we engineered edge features for each pair of nodes to be used in DyGAT, alongside the node features. This approach provides an additional perspective on the dynamic structure of the graph. It incorporates various edge features, including directional ones that influence the dispersion of PM2.5. These edge features encapsulate the spatiotemporal relationships between monitoring sites by leveraging the geographical coordinates of the nodes and the wind features at both the source and target 28 nodes. First, we calculated the geographical distance between each pair of monitoring stations utilizing the Haversine formula to take the curvature of the Earth's surface into consideration for more precise distance calculations. Then, we added wind speeds at both the source and target nodes, also we included the difference in wind direction between source and target node. Here is the Haversine formula: a = sin2 ( ∆ϕ 2 ) + cose(ϕ1) . cos(ϕ2) . sin2 ( ∆λ 2 ) (3.2) c = 2 . atan2(√a, √1 − a ) (3.3) d = R . c (3.4) where:  ϕ1and ϕ2 are the latitudes of the two points in radians.  ∆∅ is the difference in latitudes (ϕ1- ϕ2).  ∆λ is the difference in longitudes (λ1 − λ2).  R is the Earth's radius (mean radius = 6,371 km).  d is the distance between the two points. We wanted to capture the influence of wind direction on pollutant dispersion. To do this, we used the geo-locations of each pair of nodes to calculate the initial bearing. Then, we compared the wind direction to determine if the wind flow from the source node was directed towards the target node. If the wind direction aligns with the calculated bearing within a defined threshold of 45 degrees, it suggests that the wind is blowing towards the target location. The following equations are used to determine if the wind from source node blows toward target node: 𝑦 = sin(∆λ) . cos (ϕ2) (3.5) x = cos(ϕ1) . sin(ϕ2) + sin(ϕ1) . cos(ϕ2) . cos ∆λ (3.6) θ = atan2(y, x) (3.7) Intitial Bearing = ((θ × 180 π ) + 360) mod 360 (3.8) 29 Δwind = |Wind Direction (source) − Initial Bearing| (3.9) Then adjusting Δ𝑤𝑖𝑛𝑑 for the circular nature of wind directions: Δ𝑤𝑖𝑛𝑑 = { Δ𝑤𝑖𝑛𝑑 𝑖𝑓 Δ𝑤𝑖𝑛𝑑 ≤ 180 360 − Δ𝑤𝑖𝑛𝑑 𝑖𝑓 Δ𝑤𝑖𝑛𝑑 > 180 (3.10) Wind Blows Towards Target = { 1 𝑖𝑓 Δ𝑤𝑖𝑛𝑑 ≤ 45 0 𝑖𝑓 Δ𝑤𝑖𝑛𝑑 > 45 (3.11) Where:  ϕ1and ϕ2 are the latitudes of the two points in radians.  ∆∅ is the difference in latitudes (ϕ1- ϕ2).  ∆λ is the difference in longitudes (λ1 − λ2).  θ: The initial bearing angle in radians. In summary, we have five features for each edge, and they are: 1- Distance: The Haversine distance between the two nodes. 2- Wind direction difference: The difference in wind directions, measured in degrees, between the two nodes. This value provides additional context about the dynamic spatial relationship between a pair of nodes. It is a scalar value, representing the angular difference between the wind directions at the two locations. 3- Wind blows towards target: It is a binary feature, where 1 means the wind form source node is directed towards the target node and 0 means it is not. 4- Wind speed at source node. 5- Wind speed at target node. As for the Nablus dataset, the weather features were the same for all stations because they are very close to each other and located within a relatively small city, so the distance between each pair of nodes was the only edge feature, which is a static feature unlike the temporal nature of the edge features in the Beijing dataset. After processing the Beijing dataset, we turned our attention to the Nablus dataset, which contains a different set of node features. One of the features in the Nablus dataset is ‘Rain,’ which is a binary value indicating whether it is raining or not. In addition, there is 30 another feature called ‘Weather’, which contains a textual description of the weather like the sky condition (e.g. clear, cloudy, scattered clouds, etc.) and rain intensity. Therefore, we extracted the rain intensity description from the ‘Weather’ feature and added them to the ‘Rain’ feature. Both features were then encoded using an ordinal encoder. The ‘Rain’ feature had 6 ordinal categories ranging from ‘No Rain' to 'Heavy rain’. Now, the ‘Weather’ feature contains categorical description of the sky, which was converted to an ordinal description with 7 categories representing cloud coverage. We made two versions of the Nablus dataset, one version with PM2.5 values and meteorological features, and another one with added features from Saleh et al. [55] study. The study used three main categories of criteria to create the PM hazard map for Nablus’ potential sources of air pollution, which are factories, quarries and traffic. Additionally, other factors influencing PM distribution like altitude, wind speed and wind direction, were also considered. Then, they calculated the distance and direction in degrees, from each pollution source to the measurement stations. The distance and direction of pollution hazard sources were added to the second version of the data, to test the influence of the added context about nearby pollution sources on the model’s performance. 3.2 Model Architecture In this section, we present our hybrid spatiotemporal forecasting model, the DyGAT- Informer, which combines two components: the novel Dynamic GAT (DyGAT) model that we have designed to capture dynamic spatial dependencies between measuring stations at different time steps, and the Informer Transformer, which captures the temporal dependencies at each station. The informer [71] was chosen due to its ability to model complex and long-range temporal information. Before describing our primary hybrid model, it should be noted that for the case study dataset, the 'Nablus Dataset' we combined DyGAT with a Seq2Seq LSTM to create DyGAT-LSTM. This model initially served as one of the baseline models to compare the performance of DyGAT-Informer, trained with the Beijing dataset, with models that used other types of temporal components. This choice was taken to accommodate the small size of the Nablus dataset, which was insufficient to train the Informer. In the case study, DyGAT-Informer was included among the baseline models. We made this adjustment because the main goal of using the case study data, despite its small size, was to test the 31 second hypothesis of the study: that including contextual information about nearby pollution sources for each station would improve forecasting accuracy. The DyGAT- LSTM model is described in detail in section [Baseline Models] in chapter 4 [Experimental Design and Setup]. When we started building our primary hybrid model, many candidate time-series Transformers were considered including Informer, Autofromer and FEDformer. Each of these Transformers is a powerful forecasting model, but each has its own strengths and weaknesses depending on the task and the nature of the dataset. During the preliminary experimentation and hyper-parameter tuning phase, we found that the Informer was the most suitable model to our dataset. We included the Autoformer as a part of the final baseline models to compare the two Transformers’ performances. We decided to exclude the FEDformer model after the preliminary experiments, as its results were similar to those of the Autoformer. However, it had a significantly higher computation time, with each epoch averaging 9 minutes and 30 seconds, compared to just 13 seconds for the Informer and 16 seconds for the Autoformer. When it comes to spatial dependencies modeling, we hypothesized that the relations between each pair of nodes is not static, and not only determined by the geo-location of the nodes, but rather a dynamic relation that changes with the nonlinear processes of the air dynamics. To model the dynamic spatial relations between the nodes, we chose GAT because we can include different attention mechanism to assign varying weights to the nodes and edges in the graph. As mentioned in the feature engineering section, we created edge features to help the GAT model in capturing the dynamic spatial relation between each pair of nodes. The edges have a direction, meaning that edge features between a pair of nodes is different depending on which node is the source and which one is the target. The difference in edge features based on direction is because that one of the features determines whether the wind from the source node is directed towards the target node or not. The edge features also contain information about the wind speed and difference in wind direction between nodes, which are time-varying features. Wind direction and speed affect the PM2.5 dispersion in the atmosphere that is why we chose them as edge features to emphasize their role in forecasting future PM2.5 concentrations. 32 Figure (5) presents a general overview of the DyGAT-Informer model. The ‘Graph Module’ represents our DyGAT model and the ‘Transformer Module’ represents the Informer. There are multiple ways to configure a hybrid model. Each configuration has its own set of advantages and disadvantages, depending on the specific goals and requirements of the forecasting task. We experimented with both sequential and parallel configurations to integrate the DyGAT and Informer components of the hybrid model. We found that the optimal configuration involved sequential fusion, where the DyGAT processes the data first, and its output is then fed into the Informer. Additionally, we examined whether to feed the DyGAT output directly into the Informer or concatenate it with the original nodes embeddings, and we found that using concatenation produced better results. In the figure, the concatenation process is depicted by the ‘CAT’ block. Figure 5 Overall Architecture of the DyGAT-Informer Model. There are two historical sequences used as input to the first stage of the model, and they are the nodes and edges features. Assuming that p is the number of past time-steps considered in the historical input sequence, Xn and Xe are vectors representing the nodes features and the edge features respectively then the input sequence for the nodes features is Xn={𝑋𝑛𝑡−𝑝+1, 𝑋𝑛𝑡−𝑝+2, …, 𝑋𝑛𝑡}, and the input sequence for the edge features is Xe={𝑋𝑒𝑡−𝑝+1, 𝑋𝑒𝑡−𝑝+2, …, 𝑋𝑒𝑡}. Each one of these inputs is combined with temporal features of that sequence (minute, hour, day, week day and year), and used to create the initial node and edge embeddings that utilize temporal positional encoding to help the 33 Informer in understanding the tokens positions in the sequence. In figure (5), Xn is the ‘Encoder Input’ and node embeddings are ‘Encoder Embeddings’. Regarding the ‘Decoder Input’, during training, the input Yn is the future (target) sequence the model needs to forecast, where Yn = {𝑋𝑛𝑡+1, 𝑋𝑛𝑡+2, …, 𝑋𝑛𝑡+𝐻}, with H representing the forecasting horizon This input is combined with temporal features and used to generate the ‘Decoder Embedding’. For inference, the decoder relies on previously predicted values, while still incorporating temporal embeddings to preserve temporal context. The decoder employs attention mechanisms that includes masking, to ensure it attends to relevant parts of the sequence and prevents future information from being accessed during training. In addition to ‘Decoder Embeddings’, the Informer decoder also receives encoded representations from the Informer encoder as another input. In our model, we converted the input time-series features into embeddings that contains two parts: 1. Token Embeddings: In this type of embeddings, the raw input data is converted into a dense, continuous vector representation. Token embeddings are used to represent individual data points in a high-dimensional space. This dense vector is a form of representation that the Transformer architecture can effectively process. 2. Temporal Embeddings: Which represent temporal features (hour, day, month, and year) that can help the model understand periodic patterns and time-based dependencies. They are added to the token embeddings. The Informer Transformer code [72] provides three types of temporal embeddings as a hyper-parameter: ‘timeF’, ‘fixed’, and ‘learned’. They are used to embed temporal features of the data (like hour, day, week, month, etc.) into a high-dimensional space. Here's what each type represents: 1. Fixed Embedding: It creates non-trainable embeddings based on trigonometric functions (sine and cosine). These embeddings are dependent on the position of time steps and are generated using a predefined formula without any learnable parameters. This method captures the periodicity of time, which helps in modeling temporal patterns in the data. 34 2. Learned Embedding: The temporal information is embedded into a vector that can be learned during training. In this method, the temporal features (e.g., hour, day, month) are passed through an embedding layer, and the model learns the best way to represent these features during the training process. 3. Time Feature (timeF) Embedding: It takes raw time-related features and applies a linear transformation to embed these features into a higher-dimensional space. Instead of using predefined or learned embeddings, the raw time features (like "hour", "day of the week", etc.) are directly encoded by feeding them into a linear layer. This approach does not learn a specific embedding for each temporal position but linearly transforms the temporal features into a vector. Our model uses 'fixed' temporal embeddings, as they yielded better results during hyper- parameter tuning. 3.2.1 DyGAT We reviewed different GAT models that have been used in spatiotemporal forecasting tasks, and we found a spatiotemporal traffic-forecasting model called STGAT [73]. Their implementation of the “Graph Attention Layer” in the model was the same as in the original GAT [38]. Their “Graph Attention Layer” have self-attention mechanism and a learnable adjacency matrix, which they called “self-adaptive adjacency matrix”. It is a parameterized module with learnable parameters. An element in the adjacency matrix adj(i, j) specifies the weight between node i and node j. During training, the adjacency matrix is adjusted iteratively to adapt to the data. We used their Graph Attention Layer as a base to build our DyGAT model on top of it. As for the adjacency matrix, we created our own “Attention-based Dynamic Adjacency Matrix”, which uses attention mechanisms to create a weighted adjacency matrix for the graph. We kept the original self-adaptive adjacency matrix as hyper-parameter. During the hyper-parameter tuning process, the model showed better performance when our attention-based adjacency matrix was used. The DyGAT model consists of a stack of dynamic GAT layers and an attention-based dynamic adjacency matrix. Each DyGAT layer have self-attention mechanism called ‘node attention’ that captures the dynamic spatial relations between nodes based on their 35 own features using the attention-based dynamic adjacency matrix. Multi-head attention was utilized in node attention computations. The layer also contains an ‘edge attention’ weights computed based on the dynamic edge features, these attention weights are used to enhance the spatial decencies representations between the nodes. Figure (6.a) shows a general view of the DyGAT model, and Figure (6.b), shows the structure of the dynamic GAT layer. Figure 6 DyGAT Architecture. (a) Overall Architecture of the DyGAT. (b) Detailed Architecture of a Dynamic GAT Layer. 36 There are two inputs to the DyGAT model: ‘Encoder Embeddings’ and ‘Edge Embeddings’. The ‘Encoder Embeddings’ tensor is of shape [batch_size, n_nodes, n_time_steps, feature_size], where:  batch_size refers to the number of samples processed in a single training or inference step.  n_nodes corresponds to the number of nodes in the graph.  n_time_steps represents the number of time steps in the input sequence.  feature_size represents the dimensionality of the feature vector, capturing both node attributes and temporal features transformed into an embedding space. The ‘Edge Embeddings’ tensor is of shape [batch_size, n_nodes, n_nodes, n_time_steps, feature_size], where the two n_nodes dimensions indicate source and target nodes, encoding their relationships over time. At each time step, the Encoder Embeddings are extracted and represented as a tensor of shape [batch_size, n_nodes, feature_size]. This tensor is first used as input to the Dynamic Attention-based Adjacency Matrix, which computes a weighted adjacency matrix for that specific time step. The resulting weighted adjacency matrix, along with the Encoder Embeddings for that time step, serves as input to the first DyGAT layer. Similarly, the Edge Embeddings at the same time step are extracted as a tensor of shape [batch_size, n_nodes, n_nodes, feature_size] and is also used as input to the first DyGAT layer. The following sub-sections provide a detailed description for the DyGAT components. 3.2.1.1 Attention-based Dynamic Adjacency Matrix The purpose of using attention to compute the adjacency matrix is to dynamically adjust the graph structure based on the input features at each time-step. This dynamic adjustment allows the model to capture the changing relationships between nodes in the graph over time. The dynamic adjacency matrix applies linear transformations to the node features in order to obtain query and key representations then computes attention scores using dot product attention mechanism, and applies softmax to obtain a normalized adjacency matrix. 37 The DyGAT uses one time-step at time to compute the dynamic adjacency matrix. The input node features Xn at time-step t is of shape [batch_size, n_nodes, nodes_feature_size]. First, Query (Q) and Key (K) are computed as follows: Q = X .WQ (3.12) K = X .WK (3.13) Where, WQ and WK are learnable weight matrices. Then we compute the attention weights (A) using softmax function, A = softmax( Q.K √dk ) (3.14) Where dk is the dimension of the key vectors. Now we construct the adjacency matrix: Adj = A×AT (3.15) 3.2.1.2 Dynamic GAT Layer This is the core component in the DyGAT module, where two sets of attention mechanisms are used to capture the spatial dependencies between nodes at each time- step. The first attention mechanism ‘nodes attention’ is a graph self-attention mechanism similar to the one used in the original GAT, and in STGAT [73], the difference here is that STGAT used adjacency matrix that is initialized as a learnable parameter, while we utilized attention mechanism to create a true dynamic adjacency matrix that changes with each time-step. After the dynamic adjacency matrix is created, node attention uses it as the basis for computing attention scores between nodes. The adjacency matrix influences which nodes are considered neighbors and how much weight each neighbor gets when aggregating features. To compute node attention weights, the Q and K are first computed using equations (3.12) and (3.13), then the attention scores are computed using leaky ReLU function, eij = LeakyReLU(Qi. Kj) (3.16) 38 where eij represents the attention score between node i and node j, and leaky ReLU is define as follows: leakyReLU (x) = { 0.01x for x < 0 x for x ≥ 0 (3.17) Then, the attention scores are used to compute the weighted sum of neighboring node representations: nodes_attention = ∑ softmax(eij). Xjj (3.18) This process is repeated across different attention heads then the outputs of the multi- head attention are either concatenated or averaged, depending on the user’s choice, so we made this a hyper-parameter. The second attention mechanism ‘edge attention’, which is applied to the edge features we created during the feature engineering phase. We reviewed different attention mechanism and found that using additive attention [74] was the most suitable due to the nature of the directional edge features. These directional features are specific to each pair of nodes and require a more focused attention mechanism that can address these pairwise dependencies. The edge features input vector Xe at time-step t is of shape [batch_size, n_nodes, n_nodes, edge_feature_size]. First, Q and K are computed using equations (3.12) and (3.13), and then the Q and K are combined using an additive operation and passed through a hyperbolic tangent activation function (tanh): Z = tanh(Q + K) (3.19) After that, Z is passed through another linear transformation to obtain attention scores, this transformation is done through a simple neural network called ‘Linear’ in PyTorch and it is defined like this: Y = xAT + b (3.20) 39 Where x is the input tensor, A is a learnable weight matrix, and b is bias. After passing Z to the linear layer, we get the attention scores, which are then passed through a softmax function to get the edge attention weights. The edge attention weights are then used to modulate the nodes features to get another perspective of the spatial dependencies this time based on the edge features. Then the output is added to the multi-head attention outputs to be concatenated or averaged. By combining node self-attention to model how each node influences all other nodes, and edge additive attention to capture the weight of interactions between each pair of nodes based on directional features, DyGAT can create a richer spatiotemporal representation. 3.2.1.3 Final DyGAT Output Each Dynamic GAT Layer returns a modified representation of the input that includes the spatial dependency for multiple time-steps. The final vector represents a spatiotemporal representation of the entire input sequence, which will be either the final output of DyGAT or the input to the next Dynamic GAT Layer, if multiple layers are used. 3.2.2 Informer The Informer [46] is a time-series Transformer variant introduced by Zhou et. al. to solve some of the issues in the Transformers when they are used for long-range time-series forecasting. The Informer presented three major modifications to the vanilla Transformer. The first modification was introducing ProbSparse self-attention mechanism to reduce the Transformer’s quadratic time complexity. It has achieved a O(Llog L) time complexity. ProbSparse self-attention reduces the time complexity by selectively attending to the most relevant parts of the data. The second major modification was creating a self-attention distilling technique to efficiently handle extremely long sequences. Finally, the model employs a generative style decoder that predicts long time-series sequences in a single forward operation to enhance the speed of inference for long-sequence predictions significantly. The Informer model is composed of an encoder and a decoder, both of which utilize a combination of multi-head self-attention and ProbSparse self-attention layers. The encoder processes the input sequence, while the decoder generates the predicted output sequence. Here is a more detailed description of the Informer architecture: 40 1. Encoder: It processes a sequence of input features to effectively capture the temporal dependencies. It employs multiple layers of multi-head ProbSparse self-attention. It utilizes self-attention distilling to addresses redundancy in the encoder's feature maps, by prioritizing features with dominant information. It reduces the time complexity significantly through a max-pooling operation and convolutional layers. The number of self-attention distilling layers decreases progressively in each layer, forming a pyramid structure. 2. Decoder: It generates long sequential outputs efficiently. It uses the standard Transformer decoder with layers of multi-head self-attention. It utilizes “Masked Multi-head Self-Attention” to prevent the decoder from attending to future tokens during training to maintain autoregressive property. The decoder incorporates generative inference, where it generates the entire output sequence in a single forward pass to accelerate the decoding process. The output of our DyGAT model is concatenated with the original input embeddings to provide stability during training, and provide the informer with a version of the original temporal data before adding the dynamic spatial dependencies representations. The user can choose to directly use the DyGAT output as input to the Informer, or to concatenate it with the input embeddings. We made the concatenation between DyGAT output and the input embeddings as an option in the hyper-parameters, however, as mentioned before, during hyper-parameters tuning we found that concatenation produced a better performance, so it is the default option. The overall architecture of the Informer is presented in Figure (A.1) in Appendix A. 41 Chapter Four Experimental Design and Setup This chapter outlines the experimental design and setup employed to evaluate the performance of the proposed hybrid spatiotemporal PM2.5 forecasting model. We begin by introducing the baseline models against which our model’s performance will be evaluated. We will provide an overview of each model's architecture and the rationale for choosing them. Following this, the experimental setup are described, including hardware and software specifications, model training and testing parameters, hyper-parameters tuning process and the evaluation metrics that were used. 4.1 Baseline Models The first phase of the study aimed to test the first hypothesis, which stated that using a hybrid model addressing both spatial and temporal dependencies in air quality data would result in more accurate PM2.5 forecasts at individual stations compared to using only a time-series forecasting model. To evaluate this, we employed our primary model, DyGAT-Informer. We used the Beijing dataset to train both our DyGAT-Informer model and an Informer model without DyGAT to verify whether capturing spatial dependencies between the measuring stations would improve forecasting results. In addition, we also combined our DyGAT model with other time-series forecasting models to evaluate the performance of the temporal component of the hybrid model. These models are a Seq2Seq LSTM model and Autoformer. In addition, we compared our model with STGAT [73] because we used their version of GAT as a foundation of our model. We believed that this comparison would provide insights into the effectiveness of the modifications in our DyGAT variant. Finally, we chose a PyTorch implementation [75] of a model [76] that uses GCN as their spatial dependency computation module. The following is a detailed description of the architectures for the four baseline models: 1- DyGAT-Autoformer This hybrid model combines our DyGAT with the Autoformer [67], a time-series forecasting Transformer that utilizes signal decomposition technique that breaks down the time-series data into component representing trend, seasonal, and residual parts of the 42 time-series. Signal decomposition is supposed to help the model in understanding the underlying structure of the data. We selected this Transformer due to its high performance demonstrated in the literature. The model uses Autocorrelation Mechanism instead of self-attention mechanism used in Transformers. It focuses on identifying periodic patterns within the data, and their relationships, which enables the model to capture long-range dependencies. 2- DyGAT-LSTM We combined our DyGAT model with an LSTM-based encoder-decoder model, which is typically used for sequence-to-sequence tasks like time-series forecasting. The Seq2Seq LSTM model was a part of a research paper about spatiotemporal wind speed forecasting model by Bentsen et. al [77]. Since LSTMs are popular and powerful time-series forecasting models, we found it interesting to combine an LSTM with DyGAT in a hybrid model and compare its performance to that of the DyGAT-Informer. In this LSTM model, the encoder processes the input sequence using a network of multilayer LSTM, which transforms the input data into a series of internal representations, known as “hidden states” that capture the sequential dependencies and patterns in the input. These internal representations are then passed to the decoder. The decoder also uses multilayer LSTM network that generates the output sequence step-by- step by leveraging the information from the hidden states provided by the encoder. The model uses a training strategy called “recursive strategy”, where each prediction is used as the input for the next time-step, allowing the model to iteratively refine its forecasts based on previous outputs. 3- ST-GAT The ST-GAT model architecture combines a GAT with temporal convolution to handle spatiotemporal data. The model have a “TimeBlock” used to capture temporal dependencies for each node separately. The temporal convolution is done using Gated Temporal Convolutional Layer (GTCN), which helps the model learn temporal patterns by applying a series of convolutional operations over time. The TimeBlock have several GTCN layers. After the temporal processing the model uses GAT layers to capture spatial dependencies. The GAT layers use node attention mechanism to dynamically weigh the importance of neighboring nodes. After the temporal and spatial processing of the data, a 43 final output layer transforms the final feature representations into the desired forecast window. The overall ST-GAT architecture is presented in Figure (A.2 - a) in Appendix A, and the Figure (A.2 - b) illustrates the structure of GTCN. 4- ST-GCN The Spatiotemporal graph convolutional network (ST-GCN), have similar design to ST- GAT, where the model have spatiotemporal blocks, and it uses temporal convolution to process the temporal dependencies. However, this