An-Najah National University 

Faculty of Graduate Studies 

 
A HYBRID DEEP LEARNING MODEL FOR 

FORECASTING PM2.5 AIR POLLUTANT 

CONCENTRATIONS 

 
By 

Asma Mousa (Mohammad Ali) Massad 

 
Supervisors 

Dr. Anas Toma 

Dr. Abdelhaleem Khader 

 
This Thesis is Submitted in Partial Fulfillment of the Requirements for the Degree 

of Master in Artificial Intelligence, Faculty of Graduate Studies, An-Najah 

National University, Nablus - Palestine. 

2024 


IV 

 List of Contents  

Declaration ...................................................................................................................... III 

List of Contents ............................................................................................................... IV 

List of Tables .................................................................................................................. VI 

List of Figures ................................................................................................................ VII 

List of Appendices ........................................................................................................VIII 

Abstract ........................................................................................................................... IX 

Chapter One: Introduction ................................................................................................ 1 

1.1 Theoretical Basis......................................................................................................... 2 

1.2 Related work ............................................................................................................... 2 

1.3 Problem Statement and Study Objectives ................................................................... 7 

1.4 Study Hypothesis ........................................................................................................ 8 

1.5 Importance of the Study .............................................................................................. 9 

Chapter Two: Foundational Concepts ............................................................................ 10 

2.1 Attention Mechanisms .............................................................................................. 10 

2.2 Graph Attention Networks ........................................................................................ 13 

2.3 Time-series Transformers ......................................................................................... 16 

Chapter Three: Methodology .......................................................................................... 18 

3.1 Data Description and Preprocessing ......................................................................... 18 

3.1.1 Data Analysis ......................................................................................................... 19 

3.1.2 Data Preprocessing ................................................................................................ 25 

3.1.3 Features Engineering ............................................................................................. 27 

3.2 Model Architecture ................................................................................................... 30 

3.2.1 DyGAT .................................................................................................................. 34 

3.2.2 Informer ................................................................................................................. 39 

Chapter Four: Experimental Design and Setup .............................................................. 41 

4.1 Baseline Models ........................................................................................................ 41 

4.2 Experimental Setup ................................................................................................... 43 

4.2.1 Model Training and Optimization ......................................................................... 44 

4.2.2 Model Evaluation and Validation .......................................................................... 47 

Chapter Five: Results and Discussion ............................................................................ 50 

5.1 Beijing Dataset .......................................................................................................... 50 

5.1.1 DyGAT Performance Analysis .............................................................................. 52 


V 

5.1.2 Spatiotemporal vs Temporal Forecasting .............................................................. 54 

5.1.3 Comparative Analysis with Baseline Models (Beijing Dataset) ........................... 55 

5.2 Case Study - Nablus Dataset ..................................................................................... 59 

5.2.1 Impact of Contextual Features on Forecasting Accuracy ...................................... 60 

5.2.2 Comparative Analysis with Baseline Models (Nablus Dataset) ............................ 65 

Chapter Six: Conclusions and Future Work ................................................................... 69 

6.1 Conclusions ............................................................................................................... 69 

6.2 Future Work .............................................................................................................. 69 

List of Abbreviations ...................................................................................................... 71 

References ....................................................................................................................... 72 

Appendices ……..………………………………………………………………………… 79 

 ب ................................................................................................................................ الملخص

 
VI 

List of Tables 

Table 1: Datasets Properties ........................................................................................... 19 

Table 2: Statistical Summary for Numerical Features in Beijing Dataset ...................... 20 

Table 3: Statistical Summary for Numerical Features in Nablus Dataset ...................... 25 

Table 4: Models Training Parameters ............................................................................. 44 

Table 5: Models Hyper-parameters after Tuning ........................................................... 46 

Table 6: Evaluation Results of DyGAT-Informer vs. Informer for Beijing Stations (24-

Hour Forecasting) .................................................................................................... 55 

Table 7: Evaluation Results for Baseline Models over Different Forecasting Windows 

(Beijing Dataset). ..................................................................................................... 56 

Table 8: Evaluation Results of DyGAT-LSTM and LSTM Models with and without GIS 

Data (Nablus Dataset).............................................................................................. 61 

Table 9: Statistical Significance of Performance Differences between Models (Nablus 

Dataset) .................................................................................................................... 62 

Table 10: Evaluation results of Baseline Models (Nablus dataset). ............................... 65 

 
VII 

List of Figures     

Figure 1: Average PM2.5 Concentration by Month for Each Year for Beijing dataset ... 21 

Figure 2: The Boxplots of the PM2.5 Values at Each Station in the Beijing Dataset. ..... 22 

Figure 3: The Correlation Matrix of the Numerical Features in the Beijing Dataset. .... 23 

Figure 4: Wind Direction Degree Mapping Chart .......................................................... 26 

Figure 5: Overall Architecture of the DyGAT-Informer Model..................................... 32 

Figure 6: DyGAT Architecture. (a) Overall Architecture of the DyGAT. (b) Detailed 

Architecture of a Dynamic GAT Layer. .................................................................. 35 

Figure 7: Actual vs Predicted Values by DyGAT-Informer Model (Beijing Dataset) ... 51 

Figure 8: DyGAT Attention Weights Visualization. (a) Nodes Attention Weights at 3 

Different Time-steps. (b) Edge Attention Weights at 3 Different Time-steps. 

(Beijing Dataset) ...................................................................................................... 52 

Figure 9: Evaluation Scores for the Models over Different Forecasting Windows for 

Beijing Dataset. (a) MAE, (b) RMSE, (c) SMAPE ................................................. 57 

Figure 10: Actual vs Predicted Values by DyGAT-LSTM(GIS) Model (Nablus Dataset)

 ................................................................................................................................. 64 

Figure 11: Evaluation Scores for the Models over Different Forecasting Windows for 

Nablus Dataset. (a) MAE, (b) RMSE, (c) SMAPE ................................................. 66 

 
VIII 

List of Appendices 

Appendix A ..................................................................................................................... 79 

Figure A.1: The Architecture of the Informer ................................................................ 79 

Figure A.2: ST-GAT Architecture. (a) The Overall Architecture of the ST-GAT. ........ 79 

Figure A.3: The Architecture of the ST-GCN ................................................................ 80 

Figure A.4: Attention Weights Visualization from a Different Batch of Date. (a) Nodes 

Attention Weights. (b) Edges Attention Weights. ................................................... 81 

Figure A.5: Actual vs Predicted Values for Baseline Models at One Station (Beijing 

Dataset) .................................................................................................................... 82 

Figure A.6: Actual vs Predicted Values for Three Baseline Models (Nablus Dataset). . 83 

Figure A.7: Actual vs Predicted Values for DyGAT-Informer Using “Learned” 

Temporal Embedding (Nablus Dataset). ................................................................. 84 

Figure A.8: Actual vs Predicted Values for DyGAT-LSTM and DyGAT-Autofromer 

(Nablus Dataset). ..................................................................................................... 85 

 
IX 

A HYBRID DEEP LEARNING MODEL FOR FORECASTING PM2.5 AIR 

POLLUTANT CONCENTRATIONS 

By 

Asma Mousa (Mohammad Ali) Massad 

Supervisors 

Dr. Anas Toma 

Dr. Abdelhaleem Khader 

 
Abstract 

Air quality forecasting is a crucial research field that aids scientists and policymakers in 

making informed decisions to combat air pollution. Among various pollutants, PM2.5 -

particulate matter with a diameter smaller than 2.5 micrometers- poses significant health 

risks, as it can reach the lower respiratory tract and enter the bloodstream. Accurately 

forecasting PM2.5 levels is thus essential. Although machine learning-based 

spatiotemporal forecasting models have advanced, the pursuit for more accurate forecasts 

continues. The use of hybrid deep learning models for PM2.5 forecasting represents a 

promising and active area of research, as these models aim to capture complex 

spatiotemporal dependencies more effectively. 

We developed a Dynamic Graph Attention Network (DyGAT) to model spatial 

dependencies effectively. DyGAT leverages engineered edge features, including 

distance, wind speed, and wind direction, while using attention mechanisms to capture 

the dynamic nature of these dependencies. DyGAT was then combined with Informer, a 

Transformer for efficient time-series forecasting, to capture spatial and temporal patterns 

comprehensively, improving prediction accuracy. Our model was evaluated on a 

benchmark dataset from Beijing, with 420,768 records over four years. DyGAT-Informer 

outperformed a version without the DyGAT component and other baseline models. It 

achieved 50.43 for MAE, 79.9 for RMSE and 28.88% for SMAPE, compared to 51.44 

for MAE, 80.83 for RMSE and 30.25% for SMAPE in the next best model. 

Additionally, we conducted a case study using a dataset from Nablus, Palestine, 

consisting of 2692 records per station over a two months period. We incorporated 

geospatial features about nearby pollution sources into the dataset. Due to the insufficient 

number of records in the Nablus dataset for training the Informer, it was replaced with a 

sequence-to-sequence Long Short-Term Memory (LSTM) model. DyGAT-LSTM, 


X 

trained with additional geospatial features about nearby pollution sources, achieved a 

2.08% reduction in MAE, 1.17% in RMSE, and 1.96% in SMAPE. This confirms the 

benefit of incorporating such data. Finally, despite the short distances between stations, 

DyGAT successfully captured spatial dependencies, where DyGAT-LSTM achieved a 

reduction of 3.13% in MAE, 1.48% in RMSE, and 3.67% in SMAPE when compared to 

the LSTM-only model. 

Keywords: Spatiotemporal Forecasting, Air Quality, PM2.5 Forecasting, Hybrid Deep 

Learning Model,    Graph Attention Networks, Transformers. 


1 

Chapter One 

Introduction 

Air pollution is one of the existential threats to humans in our modern societies. 

According to the World Health Organization (WHO) in 2019, 99% of the world's 

population live in places where air pollution levels are above the WHO air quality 

guidelines [1]. There are many types of air pollutants, the most common ones are 

Particulate Matter (PM), Carbon monoxide (CO), Ozone (O3), Sulphur dioxide (SO2) 

and Nitrogen dioxide (NO2). Each one of these pollutants has its own effects on the 

population’s health and other environmental issues [2]. In general, air pollutants can 

affect different systems and organs in the human body; they can cause cancers 

(particularly lung cancer) and they are correlated with decreased cancer survival rates [3]. 

Additionally, exposure to air pollutants is linked to chronic respiratory and cardiovascular 

diseases [4].  Air pollution is linked to one in nine global deaths and seven million annual 

premature deaths [5].  

Particulate Matter (PM) is one of the prominent air pollutants that directly affect the 

population’s health [6]. PM it is a generic term to describe a variety of pollutants 

suspended in the air that may differ in their chemical and physical properties. PM particles 

are usually classified according to their aerodynamic diameter; the two common classes 

are PM10 and PM2.5. PM10 refers to particles that have an aerodynamic diameter less than 

10 µm, while PM2.5 refers to particles with aerodynamic diameter less than 2.5 µm [2]. 

While all classes of PM pollutants can pose health risks to the human body, finer particles 

like PM2.5 are more hazardous [4]. Exposure to fine particles like PM2.5 worsens illnesses 

like asthma, cancer, and diabetes, while also harming mental health and cognitive 

development [5]. 

General examples of air pollutants sources are industrial emissions, fuel combustion, and 

photochemical reactions from plants. Agricultural and domestic use of herbicides, 

insecticides, are another source of air pollutants [2]. To deal with the hazards of air 

pollutants like PM2.5, governments and policy makers need models that can predict the 

concentration levels of these pollutants in order to plan future environmental policies and 

issue alerts to the population when PM2.5 levels are high.  


2 

1.1 Theoretical Basis 

Forecasting PM2.5 concentrations is crucial due to the significant health risks associated 

with fine particulate matter exposure. Accurate forecasts enable timely public health 

warnings, allowing individuals to take preventive measures during high pollution periods. 

Moreover, reliable forecasts assist policymakers in implementing effective air pollution 

control strategies, thereby improving overall air quality [7]. Additionally, air pollution 

control strategies are typically long-term and costly. Therefore, using air pollutants 

forecasting models, specifically for PM2.5, can assess the effectiveness of these measures 

by modeling trends and investigating the impact of applied interventions and air pollution 

control strategies [8]. 

PM2.5 forecasting models could be divided into traditional approaches and Machine 

Learning (ML) based approaches. Among traditional approaches, the most commonly 

used ones are Chemical Transport Models (CTMs) and statistical models. However, these 

approaches face many difficulties like the complexity of atmospheric processes over 

simplification of the underlying processes that produce PM2.5, the lack of adaptability –

since these models are usually built for a specific region – and finally, their inability to 

capture the nonlinear behavior and interactions between the air components [9].  

 Due to recent advancement in relatively cheap and reliable measurement instruments, a 

large amount of historical air quality data became available. Additionally, the 

computation power has increased rapidly in the last few years, which encouraged 

researchers to explore a variety of machine learning based architectures that utilize 

historical data to overcome some of the limitations of traditional methods like modeling 

nonlinear relations and over simplification of the processes that  happen in the atmosphere 

and affect the air quality [9] [10].   

1.2 Related work 

Machine learning models are becoming widely used tools for researchers to predict the 

air quality index and the concentrations of specific pollutants, such as PM2.5. A large 

variety of traditional machine learning algorithms and deep learning models were built to 

forecast air quality and pollutants concentrations, each model has its own features, 

strengths and weakness. The choice of a forecasting model depends on the available data 

and purpose of the research [7] [9].  


3 

Machine learning approaches are divided into traditional ML and Deep Learning (DL) 

models. A recent survey [7] that reviewed machine learning models for air quality 

forecasting for the last ten years found that both traditional ML and DL models are used 

to build these models, however, deep learning based models are becoming more 

prominent in recent studies. Deep learning models are capable of modeling high 

dimensional data and nonlinear relations [9]. The DL models used in the literature for 

PM2.5 forecasting were either a variation of a single DL model or hybrid models that 

utilize more than one model to build a  forecasting model that could deal with different 

aspects of the data [7].  

 For traditional ML models, Support Vector Machine (SVM) [11] and Random Forests 

(RF) [12] were the two most used algorithms.  When it comes to DL, the most common 

models are Multi-Layer Perceptron (MLP) and Long Short-Term Memory (LSTM) neural 

networks, which is a variant of Recurrent Neural Network (RNN), followed by 

Convolutional Neural Network (CNN) [7]. 

Yao et al. [13] used Artificial Neural Network (ANN) to forecast daily PM2.5 levels, they 

used ground monitoring data and incorporated other sources like satellite images, they 

compared their model with traditional multiple regression models and their predictions 

were more accurate. Agarwal et al. [14] developed an ANN based model to forecast many 

air pollutants including PM2.5 and PM10, they provided daily forecasts and longer periods 

forecasts that could extend to four days. Their data had meteorological features and hourly 

pollution levels, and their model included a dynamic real time correction feature that can 

correct current predictions dynamically according to the model’s performance in previous 

days.  

The previous two studies [13][14] primarily focused on improving the prediction 

accuracy of traditional MLP networks without introducing significant changes to the 

architecture of the networks themselves. In contrast, other studies introduced more novel 

modifications to the MLP architecture to address different aspects in the forecasting 

process, or achieve specific goals. One study [15] presented an improvement to the 

performance of MLP by incorporating Kalman filter to the learning algorithm to adapt to 

the time variation of the air quality system. Another work [16] added a rolling mechanism 

and a gray model, which was used to preprocess the meteorological data in order to reduce 


4 

complexity. Other modifications included a hybrid model combining MLP and linear 

regression [17] for improving the accuracy of forecasting PM2.5 concentrations. In their 

model the regression model enhances the reliability of the predictions by correcting bias 

in the neural network's output. 

LSTM is another widely used method for PM2.5 forecasting due to its ability to handle 

time series data effectively. Zhou et al. [18] experimented with different variations of 

LSTM including a shallow and deep LSTM, while another work [19] built a hybrid 

LSTM-Kalman model, which performed better than the classical LSTM.  One study [20] 

built a model based on LSTM, which they called Multi-output and Multi-index of 

Supervised Learning (MMSL); it was a spatiotemporal model where they tried to build a 

prediction model for one location by incorporating the data from other neighboring 

measuring stations.  

One variation of LSTM is Bidirectional LSTM (Bi-LSTM), which keeps information 

from the past and the future. Madaan et al. [21] built a Bi-LSTM based model with 

adaptive attention mechanism.  Other works have used transfer learning with Bi-LSTM. 

For example, one study [22] used transfer learning with Bi-LSTM to transfer the 

knowledge learned within small temporal resolutions into a model that works with larger 

temporal resolutions. Another study [23] used transfer learning with Bi-LSTM to predict 

PM2.5 at new stations that do not have data. Bi-LSTM was also used in spatiotemporal 

models [24]. 

 Zhang et al. [25] created a Bi-LSTM with Empirical Mode Decomposition (EMD) model 

to predict PM2.5 values. In their model, they only used historical PM2.5 data without any 

other meteorological data, where historical PM2.5 readings were considered as an input 

signal and the EMD was used as semi-supervised learning algorithm that extracts the 

hidden frequency features. Their model was developed to enhance the short-term 

predictions especially when sudden changes are present. 

Other studies used sequence-to-sequence (Seq2Seq) architectures, where both the input 

and output are sequences, Encoder-Decoder models are an example of Seq2Seq models. 

One study [26] built an Encoder-Decoder model with LSTM, their model showed 

significant results for long-term forecasting of PM2.5. Another study [27] utilized the 


5 

Encoder-Decoder for effective PM2.5 prediction along with Genetic Algorithm for feature 

selection and outlier removal to enhance the forecasting accuracy. 

 Researchers also experimented with hybrid deep learning models to overcome some of 

the single algorithm models’ limitations. These hybrid models became more popular 

recently due to the rapid increase in computational power. One popular hybrid model is a 

combination between LSTM and CNN, in some studies [28] [29] that developed hybrid 

CNN-LSTM, the CNN was used to extract features related to the air quality while the 

LSTM is used to model the historical process of the time-series data. While Le et al. [30] 

developed a CNN-LSTM model where they used the combination to manipulate the 

spatial and temporal features of their data, which included traffic volume data along the 

PM2.5 and meteorological data. 

Zhang et al. [31] built a hybrid model that combines Variational Mode Decomposition 

(VMD) with Bi-LSTM. The VMD was used to decompose the original time series data 

signal into multiple sub-signal in the frequency domain, their work had better 

performance than models that used EMD for signal decomposition.  

Another way to utilize hybrid models is to build powerful spatiotemporal models by using 

Graph Neural Networks (GNN) to model the spatial relations, along with another deep 

learning model to deal with the temporal dependency in the data. Using LSTMs with 

GNNs is a popular combination to create hybrid spatiotemporal forecasting models. 

Some studies used a variant of GNN called Graph Convolutional Network (GCN) with 

LSTM. Qi et.al. [32] employed a GCN to capture spatial dependencies in the data and 

LSTM network to model temporal dependencies, and the final forecasts are generated by 

passing the output through a Fully Connected Network (FCN). To construct the weighted 

adjacency matrix, they used a formula that calculates the spatial distance between 

stations. In their approach, nodes are considered connected only if the distance is within 

200 km; otherwise, the adjacency matrix entry is set to zero. Teng et.al. [33] used GCN-

LSTM architecture similar to the previous study, however they incorporated Aerosol 

Optical Depth (AOD) data to improve the accuracy of PM2.5 forecasts. AOD is a measure 

of how much aerosol is present in a column of the atmosphere.  Another study [34] created 

a hybrid model using GCN with self-loops and an LSTM with temporal sliding window, 

to forecast multiple air pollutants. The temporal sliding window moves over the time-


6 

series data to generate overlapping sequences, which are then used to train the LSTM 

network.  

The previous studies [32], [33], [34] used sequential architecture, where the output of 

GCN is used as input to the LSTM, however a study by Gao et. al. [35] used a parallel 

integration method between GCNs and Bi-LSTM, where the outputs of the two models 

are concatenated and passed to a FCN to create forecasts. In addition, one study [36] used 

GCN with Gated Recurrent Network (GRU) which is another RNN variant.  Some studies 

[37] used another GNN variant called Graph Attention Network (GAT) [38] with LSTM, 

and another one combined GAT with GRU [39].  

Zhou et. al. [40]  introduced another GNN based hybrid model, where they used GCN for 

spatial dependencies and temporal convolution to obtain temporal features, their model 

used wind-field diffusion distance to describe the relation between each pair of nodes 

instead of typical Euclidian distance. There are other deep learning algorithms that were 

used to build PM2.5 forecasting models, such as the Deep Belief Network (DBN) [41] and 

Autoencoder Neural Network [42]. 

 After the introduction of Transformers [43], many researchers started experimenting with 

variations of Transformer architectures and hybrid models that include them. Liang et al. 

[44] built a Transformer based PM2.5 prediction model and according to their results, the 

model was able to predict air quality with fine spatial granularity that was not achieved 

before. 

Al-qaness et. al. [45] built ‘ResInformer’ a model with the Informer [46], which is a 

Transformer variant, introduced to improve the inference speed of long-sequence 

predictions. Their work improved the attention distillation block in the Informer. Ma et. 

al. [47] used the Informer in a spatiotemporal PM2.5 forecasting model; the key innovation 

is adding a spatiotemporal embedding layer to the Informer to model the spatial 

dependencies in the data. MSAFormer [48] is another Transformer based PM2.5 

forecasting model, the model uses Transformer with Sparse Autoencoding extracts the 

most important features from the vast amount of multi-site meteorological data, focusing 

on the information most relevant to PM2.5 prediction. Zhang et al. [49] built an Encoder-

Decoder PM2.5 prediction model with Sparse attention-based Transformer Networks 

(STN), they used the sparse attention approach to reduce the time complexity, their results 


7 

shows that the model has a small time complexity and has outperformed state-of-the-art 

models. 

Other Studies used Transformers in hybrid models, one study [50] used Informer with 

GCN, their aim was to improve air quality forecasting by capturing the dynamic and 

intricate relationships between air pollutants and their environment. Graph Transformers 

is another hybrid concept that was introduced in the literature [51], this hybrid model was 

used for many tasks including time-series forecasting tasks. Li et. al. [52] introduced 

‘Forecaster’ which is spatiotemporal forecasting model based on graph Transformers. 

Graph Transformers were also used in building PM2.5 forecasting models. One study [53] 

introduced Temporal Difference based Graph Transformer Networks (TDGTN) they 

utilize temporal difference techniques to learn long-term dependencies in PM2.5 

concentration data. 

Although spatiotemporal forecasting models have recently been investigated and used to 

forecast PM2.5 concentrations, many existing approaches do not fully address the dynamic 

nature of spatial relationships among different locations. Traditional models and hybrid 

models have demonstrated success in capturing time-dependent patterns and spatial 

correlations. However, they often rely on static spatial representations, usually governed 

by the geolocation of the monitoring stations, which overlook the fact that spatial 

dependencies between stations can change over time due to factors like weather 

conditions that affect the dispersion process of pollutants such as PM2.5. Moreover, while 

some models do integrate meteorological data into spatial dependency modeling, they 

typically do not account for the directionality of pollutant dispersion between each pair 

of monitoring stations. 

1.3 Problem Statement and Study Objectives 

Despite recent advancements in PM2.5 forecasting models, there is still significant 

potential to improve how these models capture the complex and non-linear spatiotemporal 

patterns of PM2.5 concentrations. Moreover, the integration of geospatial data, such as 

geographical information about pollution sources into deep learning models remains an 

underexplored area.  


8 

The primary objective of this study is to develop a hybrid spatiotemporal PM2.5 

forecasting model that addresses the dynamic spatial dependencies between locations and 

incorporates domain knowledge about pollutant dispersion. This hybrid model consists 

of two parts: a dynamic variant of GAT that we have developed to capture the dynamic 

spatial dependencies and the Informer [46], a Transformer designed for long-range time-

series forecasting, to model temporal dependencies. Unlike existing approaches that rely 

on static spatial connections, our DyGAT model introduces an attention-based dynamic 

adjacency matrix that evolves over time, reflecting changing patterns in PM2.5 

concentrations. In addition, we engineered directional edge features based on geolocation, 

wind direction and wind speed. These edge features highlight important features that 

affect pollutants dispersion across the region and provide directionality to the spatial 

relationships.  The model will be trained and evaluated using the benchmark dataset 

‘Beijing Multi-Site Air-Quality Dataset’ [54], which provides comprehensive air quality 

measurements and meteorological data from multiple locations in Beijing. 

The second objective of this study is to investigate the effect of incorporating additional 

geospatial data about nearby pollution sources on forecasting accuracy of the hybrid 

spatiotemporal forecasting model. To address this, we will use an air quality dataset from 

the city of Nablus in Palestine, as provided by Saleh et al.  [55]. This dataset includes 

details on pollution sources categorized by their hazard levels. We will first evaluate the 

model using only air quality-related features and then integrate the pollution source data 

to assess its impact on forecasting accuracy at each station. To accommodate the small 

size of the Nablus dataset, the temporal forecasting component of the hybrid model, 

represented by the Informer, was replaced with a Seq2Seq LSTM model. 

The study aims to enhance spatiotemporal PM2.5 forecasting by developing a model that 

effectively captures dynamic spatial dependencies and temporal patterns. It also examines 

how incorporating additional information about nearby pollution sources for each station 

influences forecasting accuracy. 

1.4 Study Hypothesis 

We hypothesize that integrating a dynamic variant of Graph Attention Network (DyGAT) 

to capture spatial dependencies with the Informer model for capturing the temporal 

dependencies will create a hybrid spatiotemporal forecasting model that improves PM2.5 


9 

concentration forecasting. Specifically, we expect that this hybrid model will outperform 

models that use only temporal forecasting without spatial dependency integration. 

In the case study using local data from Nablus, Palestine, we anticipate that incorporating 

additional contextual features about nearby pollution sources will enhance forecasting 

accuracy. 

1.5 Importance of the Study 

This study has a substantial importance for several reasons. It addresses a public health 

issue by enhancing the ability to predict PM2.5 concentrations accurately Additionally, it 

contributes to the field of environmental science and deep learning by exploring novel 

techniques for PM2.5 forecasting, which can be applied to various regions worldwide.  

Finally, it investigates how integrating geospatial and pollution sources information into 

spatiotemporal PM2.5 forecasting models affects forecasting accuracy, providing insights 

that have implications for air quality research and environmental policies. 

 
10 

Chapter Two 

Foundational Concepts 

This chapter presents the theoretical and mathematical foundations of the components 

that form the DyGAT-Informer model. The following sections provide a comprehensive 

overview of the key components, including Attention Mechanisms, Graph Attention 

Networks, and Time-series Transformers. These components are the base of the model's 

architecture. Each section includes relevant mathematical equations and theoretical 

concepts that are crucial for understanding the model's design and functionality.  

2.1 Attention Mechanisms  

During the construction of the DyGAT model, we used different attention mechanisms 

for different tasks within the model. The final version of the DyGAT model uses three 

different attention mechanisms; therefore, we will present an introduction to attention 

mechanisms before presenting the ones we used in the Methodology chapter. 

In the context of deep learning, an attention mechanism refers to a computational 

mechanism that enables neural networks to selectively focus on certain parts of the input 

data while ignoring others, similar to the cognitive attention mechanism in the human 

brain that focuses on important elements in the environment while ignoring non-relevant 

information [56]. Attention mechanisms in neural networks help models weigh the 

importance of different input elements dynamically. They are general mechanisms so they 

are used in different types of deep learning architectures in computer vision, Natural 

Language Processing (NLP) and in other models that work with sequential data [57].  

Although attention mechanisms have existed for a long time, their popularity started to 

rise after the year 2015, where they were used in many studies in machine translation and 

image captioning. In that time attention mechanisms were typically used with recurrent 

or convolutional layers in LSTMs and CNNs. However, in 2017 the Transformers [43] 

were introduced, they were built entirely using self-attention, which is a type of attention 

mechanisms [58].  

Attention mechanisms have different types and variations of these types. One model can 

be built by combining different types of attention techniques. Several surveys [56], [57], 


11 

[58], presented different taxonomies for attention mechanisms types, we summarized 

them as follows:  

 Soft and Hard Attentions: Soft attention assigns weights to each element in the 

sequence, allowing the model to focus on multiple parts simultaneously. While hard 

attention, chooses a single element to attend to at each step, making a more definitive 

choice. Soft attention is more commonly used in the literature.  

 Self-Attention: Computes attention weights within the same sequence, allowing each 

element to attend to other elements, including itself. It is widely used in sequence 

modeling tasks like machine translation and sentiment analysis. 

 Multi-Head Attention: Employs multiple sets of attention weights to capture different 

aspects or representations of the input. This enables the model to attend to various 

parts of the input simultaneously, providing richer context. 

 Scaled Dot-Product Attention: A form of self-attention where attention scores are 

computed by taking the dot product of query and key vectors, followed by scaling and 

softmax normalization. Commonly used in Transformer-based models. 

 Local Attention: Focuses on a subset of nearby inputs rather than the entire input 

sequence. This can improve efficiency, particularly for long sequences. 

 Global Attention: Considers the entire input sequence when computing attention 

weights, potentially attending to all elements in the sequence. Useful for capturing 

long-range dependencies. 

It should be noted that there are other ways to classify attention mechanisms. The 

taxonomy we presented is a summary of the four surveys we chose. These attention 

mechanisms are usually used in encoder-decoder architectures, like RNN and its variants 

LSTM and GRU, and in Transformers, specifically self-attention. They are also used in 

other architectures like CNNs, Memory Networks, GNNs and hybrid architectures. Each 

attention mechanism variation has its own customized mathematical representation. We 

will provide a generalized mathematical representation for attention mechanisms. 

A general attention mechanism computes a context vector ci for each input element i 

based on its relevance to the current context.  To compute ci we need a Query matrix (Q) 


12 

and a Key matrix (K), they could have different names depending on the model’s 

architecture or attention mechanism type, but generally, they are defined as: 

 Query: Represents the element of interest for which we want to compute the attention 

scores. It serves as a reference point or the focus of attention. 

 Key: Keys are a representation of other elements in the input sequence. Each key is 

compared with the query to determine how relevant it is to the query. Keys help in 

assessing the importance or relevance of different elements in the input sequence with 

respect to the query. 

To compute the context vector, we first need to compute attention scores also known as 

energy (e): 

eij = 𝑓(qi, kj)    (2.1) 

Where f is called a score function – also called compatibility or alignment function. The 

score function determines the similarity or compatibility between a query and a key, 

ultimately influencing the attention weights assigned to each key. There are different 

types of score functions used in attention mechanisms. Examples of score functions are: 

Additive, Multiplicative (dot product), Scaled multiplicative, Concat, Location-based, 

Similarity and Cosine-similarity based score functions [56] [59].  

After calculating the attention scores, we then calculate the attention weights by applying 

softmax function to the attention scores, 

αij = softmax(eij) =  
exp (eij)

∑ exp (eij)n
j=1

   (2.2) 

Finally, we compute ci as follows: 

ci = ∑ αij
n
j=1 . vj   (2.3) 

Where vj denotes the value or feature representation of input element j, and n is the total 

number of input elements. 


13 

2.2 Graph Attention Networks 

A graph is a data structure used to represent relationships between objects. It is made up 

of nodes (or vertices) and edges. Nodes are the objects or entities, while edges show the 

relationships between them. Edges can be either directed, where the relationship has a 

direction from one node to another, or undirected, where the relationship is bidirectional 

between two nodes without a specific direction. Graphs can also be weighted or 

unweighted. In a weighted graph, edges have different weights, which can add another 

layer of meaning to the connections. Unweighted graphs, on the other hand, treat all edges 

equally [60].  

An adjacency matrix is a way to represent a graph using a grid. It is a square matrix where 

each cell at position (i, j) indicates whether there is an edge between node i and node j. 

For a directed graph, the cell value shows the presence and direction of the edge, while 

for an undirected graph, it just shows the presence of an edge. In a weighted graph, the 

cell can also contain the weight of the edge. This matrix provides a compact and 

convenient way to store and work with graph data [60]. Graphs are versatile and can be 

used in algorithms like Dijkstra’s for finding the shortest paths, and in machine learning 

models like Graph Neural Networks (GNNs) to handle complex relationships in data. 

Graph Neural Networks  [61] are a class of deep learning models specifically designed to 

work with graph-structured data, like transportation networks, social networks, and 

biological data (e.g. genes, proteins, etc.). Unlike traditional neural networks, GNNs are 

designed to excel at capturing the relationships and dependencies between connected 

nodes in a graph, making them particularly effective for tasks where the structure of the 

data is important.  

Graph Attention Network (GAT) [38] was introduced in 2018 as a GNN variant that uses 

attention mechanisms for learning features on graphs. Some GNN variants like vanilla 

GCNs do not utilize attention mechanism, so they aggregate information from the node’s 

neighbors equally, assuming all neighboring nodes have the same influence on one node. 

In contrast, GATs use attention mechanisms to focus on key features of neighboring 

nodes. Each neighboring node is assigned a different weight. GATs do not require a 

previous knowledge of the graph structure. The first GAT presented by Veličković et. al. 

[38] used self-attention for node classification tasks in graph-structured data. They also 


14 

used Multi-head attention with concatenation of the attention heads outputs to provide 

stability for the learning process. 

GATs in general refer to a class of GNNs that utilize attention mechanisms to dynamically 

weight the importance of neighboring nodes and learn complex, context-dependent 

relationships within a graph. There are various types of GATs, with different taxonomies 

that categorize them based on factors like attention mechanisms, architecture, and 

application. Several surveys have been conducted to explore different types of GATs, 

providing detailed taxonomies and classifications.  For instance, A survey [62] done to 

the attention mechanisms used in GNNs, classified the attention mechanism used in 

Veličković et. al. [38] work as “learnable attention” where the attention weights are 

learned. The survey also identified two other classifications of attention mechanisms in 

the literature. The first is “Similarity-based attention” which is also a learned attention 

but it allocates greater attention to objects sharing more similar hidden representations or 

features. The second classification is “Attention-guided walk”, which is a type that 

utilizes attention mechanisms to guide the traversal process, unlike traditional random 

walks that traverse the graph uniformly or according to predetermined rules.  

Another Survey [59] classifies attention mechanisms in GNNs into a two-level taxonomy. 

The upper level divides attention mechanisms into three types based on their high-level 

architectural differences: Graph Recurrent Attention Networks (GRANs), Graph 

Attention Networks (GATs), and Graph Transformers. GRANs focus on integrating 

RNNs with attention mechanisms for graph data. GATs, introduces attention directly into 

graph nodes, allowing nodes to weigh their neighbors' importance. Graph Transformers, 

leverages transformer-based architectures for graph data. The lower level of the taxonomy 

categorizes attention mechanisms in GNNs based on architectural designs within the three 

main categories.  

Finally, a comprehensive review paper [63] categorized Graph Attention Networks 

(GATs) into six main types: 

1. Global Attention Networks: which focus on the overall graph structure. 

2. Multi-Layer GATs: which utilize multiple layers for deeper feature extraction. 


15 

3. Graph-embedding GATs: which utilize graph-embedding techniques to learn richer 

and more informative node representations. 

4. Spatial GATs: which incorporate spatial information for more accurate modeling. 

5. Variational GATs: which incorporate variational inference to effectively model 

complex, heterogeneous, and multimodal data across various domains. 

6. Hybrid GATs, which combine various strategies for enhanced performance.  

GATs have become widely utilized across various domains due to their ability to capture 

complex relationships within graph-structured data. They are particularly effective in 

node and graph classification, where they classify individual nodes or entire graphs based 

on their features. GATs excel in link prediction, where they estimate the likelihood of 

connections between nodes. Additionally, they are utilized in recommendation systems, 

where they enhance user-item interactions and predict user preferences. GATs are also 

used in traffic forecasting, where they model road networks to forecast traffic patterns. 

Moreover, they are used for molecular graph analysis, where they predict molecular 

properties. In image analysis, GATs improve tasks like segmentation and object detection 

by capturing spatial relationships. GATs are applied in medical fields for disease 

prediction and analysis of biological data, as well as in natural language processing, 

enhancing sentiment analysis by capturing contextual dependencies in text. Finally, 

GATs are employed in anomaly detection, including fraud detection and network 

security. These varied applications demonstrate the versatility and effectiveness of GATs 

in handling graph-based tasks across multiple fields [59][63]. 

Despite the effectiveness of GATs in handling graph-based data, they face several 

challenges. These challenges include computational complexity, as their cost increases 

with larger graphs, particularly due to the attention mechanism's complexity, leading to 

scalability issues. GATs can also suffer from over-smoothing in deep architectures, where 

node features become indistinguishable across layers, and they may struggle with 

capturing long-range dependencies in large graphs. Additionally, overfitting is a concern, 

particularly when data is limited or noisy. Moreover, interpretability remains a challenge, 

as understanding the reasoning behind attention weights can be difficult. Other limitations 

include high memory consumption due to attention weight storage and vulnerability to 

noisy data, which can reduce the robustness of GATs performance [59][62][63]. 


16 

2.3 Time-series Transformers 

The introduction of Transformers [43] has revolutionized various fields including natural 

language processing and image recognition. Self-attention mechanism, enabled them to 

capture long-range dependencies within sequences, and they have proven to be highly 

effective in modeling complex relationships [64].  

The basic architecture comprises of the following components: 

 Encoder: The encoder consists of multiple identical layers (blocks), where each one 

contains a multi-head self-attention mechanism and a position-wise feed-forward 

neural network. 

 Decoder: The decoder also comprises of multiple identical layers, however, in 

addition to the self-attention and feed-forward network present in the encoder blocks, 

each decoder block incorporates cross-attention, which is used to attend to the 

encoder's output.  

 Self-Attention Mechanism: The self-attention mechanism allows a token in the input 

sequence to attend to all other tokens in the same sequence. This mechanism enables 

the model to capture long-range dependencies efficiently. 

 Positional Encoding: Transformers do not inherently understand the sequential order 

of tokens, which is why positional encoding is usually used to add positional 

information to the input embeddings. This allows the model to differentiate between 

the positions of tokens in the sequence.  

The Transformers ability to work with long-range sequences made them a desirable 

candidate for building time-series forecasting models. In the past few years, many time-

series Transformer variants have emerged. Time-series Transformer models presented 

different modifications to the vanilla Transformer to make it suitable for time-series 

forecasting. These modifications were done at different architectural levels. One survey 

[64] divided the modifications to the vanilla Transformer in time-series Transformer into 

two main categories, either modifications to the existing architectures, or new 

architectural innovations.  

The first modification to the existing components is presenting new positional encoding 

methods like “Learnable Positional Encoding”, where the model can learn the positional 


17 

encoding from the input sequence. Another positional encoding technique is “Time-stamp 

Encoding”, where the time related features (hour, day, year, holidays etc.) are used as a 

positional encoding method.  

The second type of modifications to the original components is modification to the 

attention module. The original Transformer has a memory and time complexity of O(N2), 

N is the length of the sequence, this poses a computational bottleneck when dealing with 

long-sequences. So some time-series Transformers added sparsity to the attention 

mechanism to reduce the memory and time complexity, examples of these Transformers 

are Informer [46] , LogTrans [65] and Pyraformer[66] and many others.  

An example of architectural innovations to the Transformer is presented by the Informer, 

which has incorporated max-pooling layers between attention blocks. On the other hand, 

Pyraformer utilizes a C-ary tree-based attention mechanism.  

Other time-series forecasting Transformers used signal decomposition to enhance the 

model’s forecasting abilities like Autoformer [67] and FEDfromer [68]. In addition, they 

have presented novel attention mechanisms, where the Autoformer introduced a novel 

auto-correlation mechanism that analyzes the data's periodicity to identify and aggregate 

similar sub-series, which enables the model to capture dependencies within the data more 

efficiently. FEDformer was built on the Autoformer, however it has introduced its own 

Frequency Enhanced Attention (FEA) mechanism. 

 
18 

Chapter Three 

Methodology 

In this chapter, we will present a thorough description of the datasets used in the study 

and provide insights acquired during the exploratory data analysis. In addition, we will 

explain the pre-processing techniques used to prepare the data for training and testing. 

Moreover, we will provide a detailed breakdown of the architecture of the proposed 

model. 

3.1 Data Description and Preprocessing  

In this study, the Beijing Multi-Site Air-Quality Dataset [54] will be the main dataset used 

to train and test the proposed model. The dataset contains a collection of spatiotemporal 

air quality measurements and meteorological data across 12 monitoring stations in 

Beijing. The Beijing air quality data was collected from March 1st, 2013, to February 28th, 

2017. 

The second dataset that will be used is an air quality dataset collected from 8 measuring 

stations in Nablus city in Palestine. The data was collected from January 6th 2022, to 

March 3rd 2022. It includes meteorological data for the city of Nablus, obtained from the 

'Time and Date' weather website, covering the same period as the air quality dataset. This 

dataset was collected as a part of a study by Saleh et al. [55], where they presented a 

methodology for selecting air quality monitoring locations based on low-cost sensors and 

Geographic Information Systems (GIS) [69].  

The distances between the measuring stations in the Nablus air quality dataset are 

considered small, where the longest distance between a pair of nodes is 10.93 km. 

Therefore, the weather data was the same for the whole city of Nablus, hence the only 

differences between nodes’ features are the geolocation (longitude and latitude) and PM2.5 

readings. However, we chose this data to study another factor that is usually missing in 

other air quality datasets, which is the information about nearby pollution sources.   

A general description of both datasets is presented in Table (1). One notable difference 

between the two datasets is the length of the time-step, where the Beijing dataset has a 


19 

time-step of 1 hour while the Nablus dataset has a time-step of one minute for PM2.5 and 

30 minutes for meteorological data.  

Table 1 

Datasets Properties 

Properties Beijing Nablus 

Number of stations 12 8 

Number of Features 16 11 

Number of Records 420768 
PM2.5: 542607; Meteorological Data: 

2531 

Timestamp Interval 1 Hour 
PM2.5: 1 Minute; Meteorological 

Data: 30 Minutes 

Time Span 
March 1st, 2013, to 

February 28th, 2017 

January 6th 2022, to 

March 3rd 2022 

Although the Nablus dataset contains a high number of records due to the PM2.5 data 

being recorded every minute, the final number of records per station is 2692, and the total 

number of records across all stations is 21536. This reduction is due to the meteorological 

data being recorded at a 30-minute interval. For the Beijing dataset, there are 35064 

records per station, and the total number of records is 420,768.  

3.1.1 Data Analysis 

The Beijing dataset consists of measurements from 12 stations, to perform data analysis, 

the data from all 12 stations were combined. The raw dataset contained 18 features, 

including 16 numerical features and two categorical features. Out of the 16 numerical 

features, four features represent temporal information (year, month, day, hour), one 

feature is the ordinal number of the timestamp. The remaining 11 numerical features are 

related to meteorological and air pollutant features. The two categorical features are 

“station name”, “wind direction”. Two additional features, longitude and latitude, were 

added for each station. Wind direction is a categorical feature with 16 categories, where 

for example, ‘W’ means west, ‘SW’ means southwest, and ‘WSW’ means west-

southwest.  

Table (2), presents a statistical summary of the relevant numerical features. The summary 

includes count, mean, minimum, maximum, standard deviation and Coefficient of 

Variation (CV) for each feature. The statistical summary provides valuable insights into 


20 

the distribution and range of these variables. For example, Atmospheric Pressure (PRES) 

demonstrates low variability in the data points with a CV of 1.04%. Features such as 

PM10, NO2, TEMP, and WSPM exhibit moderate variability, suggesting a noticeable but 

not extreme spread in their data values.  

Table 2 

Statistical Summary for Numerical Features in Beijing Dataset 

Feature Description count mean min max std CV 

PM10 

PM10 

concentration in 

µg/m3 

415037 104.57 2.00 999.00 91.70 87.67% 

SO2 

Sulfur dioxide 

concentration in 

µg/m3 

412682 15.82 0.29 500.00 21.63 136.74% 

NO2 

Nitrogen dioxide 

concentration in 

µg/m3 

409675 50.64 1.03 290.00 35.08 69.25% 

CO 

Carbon monoxide 

concentration in 

µg/m3 

401843 1229.30 100.00 10000 1157.82 94.21% 

O3 

Ozone 

concentration in 

µg/m3 

409210 57.31 0.21 1071.00 56.55 98.67% 

TEMP 
Temperature in 

degrees Celsius 
420390 13.54 -19.90 41.60 11.44 84.50% 

PRES 
Atmospheric 

Pressure in hPa 
420395 1010.75 982.40 1042.80 10.47 1.04% 

DEWP 

Dew point 

temperature in 

degrees Celsius 

420385 2.49 -43.40 29.10 13.79 553.01% 

RAIN 
Rain precipitation 

in mm 
420398 0.06 0.00 72.50 0.82 1366.67% 

WSPM 
Wind speed in 

m/s 
420464 1.73 0.00 13.20 1.25 72.25% 

PM2.5 

PM2.5 

concentration in 

µg/m3 

412954 79.74 2.00 999.00 80.74 101.25% 


21 

High variability is observed in SO2, CO, O3, and PM2.5 with high CVs, indicating a broad 

range of data points. Notably, DEWP and RAIN show extremely high variability, with 

CVs that suggest substantial fluctuations relative to their means. 

To better understand the temporal variations in PM2.5 concentrations, Figure (1), presents 

a bar chart depicting the average PM2.5 concentration across different months from 2013 

to 2017. Each bar represents the average PM2.5 concentration for a particular month. It 

seems that there are high variabilities in the PM2.5 average values for the same month over 

the years. A clear example is February, where in 2014 it had the highest average PM2.5 

concentration of the year, while in 2016 it had the lowest average concentration of the 

year. 

Figure 1 

Average PM2.5 Concentration by Month for Each Year for Beijing Dataset 

Figure (2) shows boxplots for the PM2.5 values at each station for the Beijing dataset. The 

boxplots visualize the distribution of the data. They offer insights into our data, including 

the 25th percentile (first quartile (Q1)), the median (Q2), the 75th percentile (third quartile 

(Q3)). They also highlight the minimum and maximum values, as well as the presence of 

outliers. Each horizontal boxplot represents a specific station, and the vertical axis 

displays PM2.5 values. The box for each station provides a summary of the data 

distribution. The elements of the boxplot are as follows [70]: 


22 

 Box: It represents the Interquartile Range (IQR) which extends from the first 

quartile (Q1) to the third quartile (Q3), representing the middle 50% of the data. 

The line within the box indicates the median (Q2) PM2.5 value. 

 Whiskers: The lines extending from the box indicate the range of the data within 

1.5 times the IQR. 

  Outliers: Individual data points plotted beyond the whiskers are considered 

outliers. 

Figure 2 

The Boxplots of the PM2.5 Values at Each Station in the Beijing Dataset. 

 
Most stations show similar distributions in their PM2.5 values, with many having their 

median values close to 50. There are many outliers especially at higher PM2.5 values, 

which indicates some spikes in pollution levels higher than the typical range for each 

station. Some stations have wider IQRs and more outliers, indicating greater variability 

in PM2.5 values. The Wanshouxigong station has the highest outlier value (close to 1000), 

suggesting an episode of severe pollution. 


23 

We have decided to keep the outliers in the data because in the context of environmental 

data like PM2.5 levels, outliers often represent real and extreme events like pollution level 

spikes due to weather conditions, industrial activities, or traffic. In this case, including 

outliers allows our model to learn and adapt to these extreme events, providing more 

realistic and comprehensive understanding of air quality dynamics. This will lead to a 

more robust model that can provide better forecasts. Secondly, air quality forecasting 

models are often used in scenarios where predicting extreme events is crucial. Which 

makes including outliers in the training data essential, especially for cases where the 

forecasting model is used in early warning systems. 

To ensure that these extreme values are not measurement errors or noise, we examined 

these extreme PM2.5 level across all stations. We observed that these extreme values 

occurred consistently across all measuring stations, making the likelihood of an 

instrument error very low. Additionally, the extreme values remain within the possible 

range for PM2.5 concentration levels. Figure (3) shows the correlation matrix for the 

Beijing dataset, which provides a visual representation of the relationships between the 

numerical features in the Beijing dataset.  

Figure 3 

The Correlation Matrix of the Numerical Features in the Beijing Dataset. 

 
24 

The matrix highlights both positive and negative correlation coefficients (r), with darker 

colors indicating stronger positive relationships. PM2.5 shows a strong positive correlation 

with PM10 (r = 0.88), CO (r = 0.77), and SO2 (r = 0.66), which suggests that these 

pollutants’ levels are related. One possible reason for the high positive correlation 

between these pollutants is having similar sources such as vehicle emissions and 

industrial activities.   

We have decided to include all of the features when training the hybrid spatiotemporal 

PM2.5 forecasting model to fully leverage the comprehensive nature of the dataset. Each 

feature, such as various air pollutants (PM2.5, PM10, NO2, CO, O3, SO2), meteorological 

data (temperature, pressure, dew point, wind speed, wind direction, precipitation), and 

temporal information, offers unique insights into the complex atmospheric processes 

affecting PM2.5 levels. The DyGAT-Informer hybrid model is well-equipped to handle 

feature interactions, even those involving highly correlated variables. GAT’s attention 

mechanism allows it to focus on the most relevant relationships between features, while 

Informer’s self-attention mechanism identifies and prioritizes the temporal dependencies 

that matter most. By utilizing the full feature set, the model can capture both direct and 

indirect interactions among features, thereby enhancing its ability to forecast PM2.5 levels 

with greater accuracy. 

Regarding the Nablus dataset, as stated before the dataset was created by combing the 

PM2.5 readings from 8 locations in Nablus city and weather data retrieved from the 

weather website. The dataset from Nablus contains 11 features, including 8 numerical 

features, 2 categorical features, and 1 temporal feature. The numerical features are latitude 

and longitude (spatial coordinates), temperature, humidity, atmospheric pressure, 

visibility distance, rain and PM2. The categorical features are station name and ‘Weather’. 

And finally, the temporal feature represents the timestamp of each measurement. 

The categorical feature ‘Weather’ is a description of the weather in phrases separated by 

dot (e.g. “Light rain. Partly cloudy.”). The ‘Rain’ feature is binary, with a value of zero 

indicating no rain and one indicating rain. The textual description of the amount of rain 

is stated in the ‘Weather’ feature. 


25 

Table (3) presents statistical summary for the numerical features. There seems to be some 

errors in the measurements, because the maximum value in PM2.5 (17188.39 µg/m3) is 

not reasonable, and the minimum value is (-1) which is also not possible. 

Table 3 

Statistical Summary for Numerical Features in Nablus Dataset 

Feature Description count mean std min max CV 

Temp 
Temperature 

in Fahrenheit. 
21536 54.81 6.291 37 81 11.48% 

Wind speed 
Wind speed 

in mph. 
21536 7.44 4.90 0 37 65.8% 

Humidity 
Humidity as 

a percentage. 
21536 0.71 0.16 0.18 1 23.1% 

Barometer 

Atmospheric 

pressure in 

"Hg. 

21536 30.04 0.12 29.74 30.39 0.42% 

Rain 

Binary 

(raining or 

not). 

21536 0.41 1.14 0 1 278.4% 

PM2.5 

PM2.5 

concentration 

in µg/m3 

21536 22.85 137.64 -1 17188.39 601.6% 

The CV analysis reveals that ‘Barometer’ feature has low variability in data points, while 

Temp, Wind speed, and Humidity show moderate variability. Finally, PM2.5 have high 

variability, indicating substantial fluctuations. 

For the Nablus dataset we had to deal with some extreme outliers that do not represent 

real pollution level but rather a measurement error. For example, at the “Unit_F_Hijjawi” 

station, the maximum PM2.5 value was 17188.39 µg/m3, which is highly improbable. In 

addition, the minimum value at the “118_NNUH” station was -1, which is not possible 

because PM2.5 values cannot have negative values.  

3.1.2 Data Preprocessing 

The first step in preprocessing the data is examining the number of missing values for 

each feature, and filling them with an imputation method. For the Beijing dataset, the 


26 

missing values for meteorological features were less than 0.1%. However, for pollutant 

features, the missing values were mostly around 2%, except for CO, which had 4.92%, 

and O3, which had 3.16%. For the Nablus dataset, the only two features with missing 

values were wind speed, with 0.04%, and visibility, which had a much higher percentage 

of missing values at 44.76%. 

Since this is a time-series data, simple imputation methods like mean and median may 

not be suitable because a value at one time-step is related to values around it more than 

values at distant time-steps. For the numerical features, an Iterative Imputation method 

was used, which uses a multivariate imputation algorithm to estimate missing values 

iteratively. It treats each feature’s column with missing values as a target variable and 

uses the other columns as predictors to estimate the missing values. At each iteration, the 

Iterative Imputer uses a regression model to predict the missing values. For the categorical 

features like wind direction, a k-nearest neighbors Imputer was utilized, which is used for 

imputing missing values in datasets using the k-nearest neighbors algorithm. It imputes 

missing values based on the values of neighboring data points, and it can handle 

categorical data. 

In the Beijing dataset, we encoded the wind direction by converting the textual description 

of the direction into angular degrees, so we ended up with 16 different degrees to describe 

the wind direction. Figure (4) shows the wind direction mapping from textual categories 

to degrees. 

Figure 4 

Wind Direction Degree Mapping Chart. 

 
27 

The PM2.5 measurement in the Nablus dataset had temporal granularity of 1 minute, while 

the meteorological measurements had a 30 minutes temporal granularity. We aggregated 

PM2.5 measurement into 30 minutes time interval to match the meteorological data. 

As stated in the section [Data Analysis], the Nablus dataset had some outliers due to 

measurement error, so we had to fix them. The negative PM2.5 value of (-1) was changed 

to zero. Moreover, the extremely high value was treated as missing and filled during the 

imputation process. Regarding the ‘Visibility’ feature (visibility distance), it was removed 

due to approximately 45% of the values being missing, and the majority of the non-

missing entries having a value of 10 miles, offering little variability. 

We applied min-max normalization to the data, to ensure a uniform scaling of features 

and improve model performance, especially when the features have different ranges. Min-

max normalization transforms each feature to a common scale by mapping its values to a 

range between 0 and 1 using this formula:  

xnorm =  
x−min (x)

max(x)−min (x)
    (3.1) 

where min(x) and max(x) are the minimum and maximum values of the feature, 

respectively. Other normalization methods were explored during the initial experiments, 

and min-max normalization yielded the best results in terms of evaluation metrics. 

3.1.3 Features Engineering 

In the previous section, we addressed the data pre-processing techniques that have been 

used like filling the missing values, normalization and encoding of categorical data. In 

this section, we will present the feature engineering techniques we used in both datasets. 

Feature engineering involves creating new features or modifying existing ones to improve 

the performance of machine learning models. 

Starting with the Beijing dataset, we engineered edge features for each pair of nodes to 

be used in DyGAT, alongside the node features. This approach provides an additional 

perspective on the dynamic structure of the graph. It incorporates various edge features, 

including directional ones that influence the dispersion of PM2.5. These edge features 

encapsulate the spatiotemporal relationships between monitoring sites by leveraging the 

geographical coordinates of the nodes and the wind features at both the source and target 


28 

nodes. First, we calculated the geographical distance between each pair of monitoring 

stations utilizing the Haversine formula to take the curvature of the Earth's surface into 

consideration for more precise distance calculations. Then, we added wind speeds at both 

the source and target nodes, also we included the difference in wind direction between 

source and target node. 

 Here is the Haversine formula: 

 a = sin2 (
∆ϕ

2
) + cose(ϕ1) . cos(ϕ2) . sin2 (

∆λ

2
)    (3.2) 

                           c = 2 . atan2(√a, √1 − a  )                                     (3.3) 

                           d = R . c                                                                  (3.4) 

where: 

 ϕ1and ϕ2 are the latitudes of the two points in radians. 

 ∆∅ is the difference in latitudes (ϕ1- ϕ2). 

 ∆λ is the difference in longitudes (λ1 −  λ2). 

 R is the Earth's radius (mean radius = 6,371 km). 

 d is the distance between the two points. 

We wanted to capture the influence of wind direction on pollutant dispersion. To do this, 

we used the geo-locations of each pair of nodes to calculate the initial bearing. Then, we 

compared the wind direction to determine if the wind flow from the source node was 

directed towards the target node. If the wind direction aligns with the calculated bearing 

within a defined threshold of 45 degrees, it suggests that the wind is blowing towards the 

target location. The following equations are used to determine if the wind from source 

node blows toward target node: 

𝑦 = sin(∆λ) . cos (ϕ2)                                                 (3.5) 

x = cos(ϕ1) . sin(ϕ2) + sin(ϕ1) . cos(ϕ2) . cos ∆λ    (3.6) 

θ = atan2(y, x)                                                            (3.7) 

                        Intitial Bearing = ((θ ×  
180

π
 ) + 360)  mod 360      (3.8) 


29 

                       Δwind =  |Wind Direction (source) − Initial Bearing|   (3.9) 

Then adjusting Δ𝑤𝑖𝑛𝑑 for the circular nature of wind directions: 

Δ𝑤𝑖𝑛𝑑  =  {
Δ𝑤𝑖𝑛𝑑                         𝑖𝑓 Δ𝑤𝑖𝑛𝑑 ≤ 180

360 −  Δ𝑤𝑖𝑛𝑑           𝑖𝑓 Δ𝑤𝑖𝑛𝑑 > 180
              (3.10) 

                    Wind Blows Towards Target =  {
1    𝑖𝑓 Δ𝑤𝑖𝑛𝑑  ≤ 45
0   𝑖𝑓 Δ𝑤𝑖𝑛𝑑  > 45

    (3.11) 

Where: 

 ϕ1and ϕ2 are the latitudes of the two points in radians. 

 ∆∅ is the difference in latitudes (ϕ1- ϕ2). 

 ∆λ is the difference in longitudes (λ1 −  λ2). 

 θ: The initial bearing angle in radians. 

In summary, we have five features for each edge, and they are: 

1- Distance: The Haversine distance between the two nodes. 

2- Wind direction difference: The difference in wind directions, measured in degrees, 

between the two nodes. This value provides additional context about the dynamic 

spatial relationship between a pair of nodes. It is a scalar value, representing the 

angular difference between the wind directions at the two locations. 

3- Wind blows towards target: It is a binary feature, where 1 means the wind form source 

node is directed towards the target node and 0 means it is not. 

4- Wind speed at source node. 

5- Wind speed at target node. 

 As for the Nablus dataset, the weather features were the same for all stations because 

they are very close to each other and located within a relatively small city, so the distance 

between each pair of nodes was the only edge feature, which is a static feature unlike the 

temporal nature of the edge features in the Beijing dataset.  

After processing the Beijing dataset, we turned our attention to the Nablus dataset, which 

contains a different set of node features. One of the features in the Nablus dataset is 

‘Rain,’ which is a binary value indicating whether it is raining or not. In addition, there is 


30 

another feature called ‘Weather’, which contains a textual description of the weather like 

the sky condition (e.g. clear, cloudy, scattered clouds, etc.) and rain intensity. Therefore, 

we extracted the rain intensity description from the ‘Weather’ feature and added them to 

the ‘Rain’ feature. Both features were then encoded using an ordinal encoder. The ‘Rain’ 

feature had 6 ordinal categories ranging from ‘No Rain' to 'Heavy rain’. Now, the 

‘Weather’ feature contains categorical description of the sky, which was converted to an 

ordinal description with 7 categories representing cloud coverage. 

We made two versions of the Nablus dataset, one version with PM2.5 values and 

meteorological features, and another one with added features from Saleh et al. [55] study. 

The study used three main categories of criteria to create the PM hazard map for Nablus’ 

potential sources of air pollution, which are factories, quarries and traffic. Additionally, 

other factors influencing PM distribution like altitude, wind speed and wind direction, 

were also considered. Then, they calculated the distance and direction in degrees, from 

each pollution source to the measurement stations. The distance and direction of pollution 

hazard sources were added to the second version of the data, to test the influence of the 

added context about nearby pollution sources on the model’s performance. 

3.2 Model Architecture   

In this section, we present our hybrid spatiotemporal forecasting model, the DyGAT-

Informer, which combines two components: the novel Dynamic GAT (DyGAT) model 

that we have designed to capture dynamic spatial dependencies between measuring 

stations at different time steps, and the Informer Transformer, which captures the 

temporal dependencies at each station. The informer [71] was chosen due to its ability to 

model complex and long-range temporal information.  

Before describing our primary hybrid model, it should be noted that for the case study 

dataset, the 'Nablus Dataset' we combined DyGAT with a Seq2Seq LSTM to create 

DyGAT-LSTM. This model initially served as one of the baseline models to compare the 

performance of DyGAT-Informer, trained with the Beijing dataset, with models that used 

other types of temporal components. This choice was taken to accommodate the small 

size of the Nablus dataset, which was insufficient to train the Informer. In the case study, 

DyGAT-Informer was included among the baseline models. We made this adjustment 

because the main goal of using the case study data, despite its small size, was to test the 


31 

second hypothesis of the study: that including contextual information about nearby 

pollution sources for each station would improve forecasting accuracy. The DyGAT-

LSTM model is described in detail in section [Baseline Models] in chapter 4 

[Experimental Design and Setup]. 

When we started building our primary hybrid model, many candidate time-series 

Transformers were considered including Informer, Autofromer and FEDformer. Each of 

these Transformers is a powerful forecasting model, but each has its own strengths and 

weaknesses depending on the task and the nature of the dataset. During the preliminary 

experimentation and hyper-parameter tuning phase, we found that the Informer was the 

most suitable model to our dataset. We included the Autoformer as a part of the final 

baseline models to compare the two Transformers’ performances. We decided to exclude 

the FEDformer model after the preliminary experiments, as its results were similar to 

those of the Autoformer. However, it had a significantly higher computation time, with 

each epoch averaging 9 minutes and 30 seconds, compared to just 13 seconds for the 

Informer and 16 seconds for the Autoformer. 

When it comes to spatial dependencies modeling, we hypothesized that the relations 

between each pair of nodes is not static, and not only determined by the geo-location of 

the nodes, but rather a dynamic relation that changes with the nonlinear processes of the 

air dynamics. To model the dynamic spatial relations between the nodes, we chose GAT 

because we can include different attention mechanism to assign varying weights to the 

nodes and edges in the graph.   

As mentioned in the feature engineering section, we created edge features to help the 

GAT model in capturing the dynamic spatial relation between each pair of nodes. The 

edges have a direction, meaning that edge features between a pair of nodes is different 

depending on which node is the source and which one is the target. The difference in edge 

features based on direction is because that one of the features determines whether the 

wind from the source node is directed towards the target node or not. The edge features 

also contain information about the wind speed and difference in wind direction between 

nodes, which are time-varying features. Wind direction and speed affect the PM2.5 

dispersion in the atmosphere that is why we chose them as edge features to emphasize 

their role in forecasting future PM2.5 concentrations.  


32 

Figure (5) presents a general overview of the DyGAT-Informer model. The ‘Graph 

Module’ represents our DyGAT model and the ‘Transformer Module’ represents the 

Informer. There are multiple ways to configure a hybrid model. Each configuration has 

its own set of advantages and disadvantages, depending on the specific goals and 

requirements of the forecasting task. We experimented with both sequential and parallel 

configurations to integrate the DyGAT and Informer components of the hybrid model. 

We found that the optimal configuration involved sequential fusion, where the DyGAT 

processes the data first, and its output is then fed into the Informer. Additionally, we 

examined whether to feed the DyGAT output directly into the Informer or concatenate it 

with the original nodes embeddings, and we found that using concatenation produced 

better results. In the figure, the concatenation process is depicted by the ‘CAT’ block. 

Figure 5 

Overall Architecture of the DyGAT-Informer Model.  

 
There are two historical sequences used as input to the first stage of the model, and they 

are the nodes and edges features. Assuming that p is the number of past time-steps 

considered in the historical input sequence, Xn and Xe are vectors representing the nodes 

features and the edge features respectively then the input sequence for the nodes features 

is  Xn={𝑋𝑛𝑡−𝑝+1,  𝑋𝑛𝑡−𝑝+2, …, 𝑋𝑛𝑡}, and the input sequence for the edge features is 

Xe={𝑋𝑒𝑡−𝑝+1,  𝑋𝑒𝑡−𝑝+2, …, 𝑋𝑒𝑡}. Each one of these inputs is combined with temporal 

features of that sequence (minute, hour, day, week day and year), and used to create the 

initial node and edge embeddings that utilize temporal positional encoding to help the 


33 

Informer in understanding the tokens positions in the sequence. In figure (5), Xn is the 

‘Encoder Input’ and node embeddings are ‘Encoder Embeddings’.   

Regarding the ‘Decoder Input’, during training, the input Yn is the future (target) 

sequence the model needs to forecast, where Yn = {𝑋𝑛𝑡+1,  𝑋𝑛𝑡+2, …, 𝑋𝑛𝑡+𝐻}, with H 

representing the forecasting horizon This input is combined with temporal features and 

used to generate the ‘Decoder Embedding’. For inference, the decoder relies on 

previously predicted values, while still incorporating temporal embeddings to preserve 

temporal context. The decoder employs attention mechanisms that includes masking, to 

ensure it attends to relevant parts of the sequence and prevents future information from 

being accessed during training. In addition to ‘Decoder Embeddings’, the Informer 

decoder also receives encoded representations from the Informer encoder as another 

input. 

In our model, we converted the input time-series features into embeddings that contains 

two parts: 

1. Token Embeddings: In this type of embeddings, the raw input data is converted into 

a dense, continuous vector representation. Token embeddings are used to represent 

individual data points in a high-dimensional space. This dense vector is a form of 

representation that the Transformer architecture can effectively process. 

2. Temporal Embeddings: Which represent temporal features (hour, day, month, and 

year) that can help the model understand periodic patterns and time-based 

dependencies. They are added to the token embeddings. 

The Informer Transformer code [72] provides three types of temporal embeddings as a 

hyper-parameter: ‘timeF’, ‘fixed’, and ‘learned’. They are used to embed temporal 

features of the data (like hour, day, week, month, etc.) into a high-dimensional space. 

Here's what each type represents: 

1. Fixed Embedding: It creates non-trainable embeddings based on trigonometric 

functions (sine and cosine). These embeddings are dependent on the position of time 

steps and are generated using a predefined formula without any learnable parameters. 

This method captures the periodicity of time, which helps in modeling temporal 

patterns in the data.  


34 

2. Learned Embedding: The temporal information is embedded into a vector that can be 

learned during training. In this method, the temporal features (e.g., hour, day, month) 

are passed through an embedding layer, and the model learns the best way to represent 

these features during the training process. 

3. Time Feature (timeF) Embedding: It takes raw time-related features and applies a 

linear transformation to embed these features into a higher-dimensional space. Instead 

of using predefined or learned embeddings, the raw time features (like "hour", "day 

of the week", etc.) are directly encoded by feeding them into a linear layer. This 

approach does not learn a specific embedding for each temporal position but linearly 

transforms the temporal features into a vector.  

Our model uses 'fixed' temporal embeddings, as they yielded better results during hyper-

parameter tuning.  

3.2.1 DyGAT 

We reviewed different GAT models that have been used in spatiotemporal forecasting 

tasks, and we found a spatiotemporal traffic-forecasting model called STGAT [73]. Their 

implementation of the “Graph Attention Layer” in the model was the same as in the 

original GAT [38]. Their “Graph Attention Layer” have self-attention mechanism and a 

learnable adjacency matrix, which they called “self-adaptive adjacency matrix”. It is a 

parameterized module with learnable parameters. An element in the adjacency matrix 

adj(i, j) specifies the weight between node i and node j. During training, the adjacency 

matrix is adjusted iteratively to adapt to the data.  

We used their Graph Attention Layer as a base to build our DyGAT model on top of it. 

As for the adjacency matrix, we created our own “Attention-based Dynamic Adjacency 

Matrix”, which uses attention mechanisms to create a weighted adjacency matrix for the 

graph. We kept the original self-adaptive adjacency matrix as hyper-parameter. During 

the hyper-parameter tuning process, the model showed better performance when our 

attention-based adjacency matrix was used. 

The DyGAT model consists of a stack of dynamic GAT layers and an attention-based 

dynamic adjacency matrix. Each DyGAT layer have self-attention mechanism called 

‘node attention’ that captures the dynamic spatial relations between nodes based on their 


35 

own features using the attention-based dynamic adjacency matrix. Multi-head attention 

was utilized in node attention computations. The layer also contains an ‘edge attention’ 

weights computed based on the dynamic edge features, these attention weights are used 

to enhance the spatial decencies representations between the nodes. Figure (6.a) shows a 

general view of the DyGAT model, and Figure (6.b), shows the structure of the dynamic 

GAT layer.   

Figure 6 

DyGAT Architecture. (a) Overall Architecture of the DyGAT. (b) Detailed Architecture of a 

Dynamic GAT Layer. 

 
36 

There are two inputs to the DyGAT model: ‘Encoder Embeddings’ and ‘Edge 

Embeddings’. The ‘Encoder Embeddings’ tensor is of shape [batch_size, n_nodes, 

n_time_steps, feature_size], where:  

 batch_size refers to the number of samples processed in a single training or inference 

step. 

 n_nodes corresponds to the number of nodes in the graph. 

 n_time_steps represents the number of time steps in the input sequence. 

 feature_size represents the dimensionality of the feature vector, capturing both node 

attributes and temporal features transformed into an embedding space. 

The ‘Edge Embeddings’ tensor is of shape [batch_size, n_nodes, n_nodes, n_time_steps, 

feature_size], where the two n_nodes dimensions indicate source and target nodes, 

encoding their relationships over time.   

At each time step, the Encoder Embeddings are extracted and represented as a tensor of 

shape [batch_size, n_nodes, feature_size]. This tensor is first used as input to the Dynamic 

Attention-based Adjacency Matrix, which computes a weighted adjacency matrix for that 

specific time step. The resulting weighted adjacency matrix, along with the Encoder 

Embeddings for that time step, serves as input to the first DyGAT layer. Similarly, the 

Edge Embeddings at the same time step are extracted as a tensor of shape [batch_size, 

n_nodes, n_nodes, feature_size] and is also used as input to the first DyGAT layer. The 

following sub-sections provide a detailed description for the DyGAT components. 

3.2.1.1 Attention-based Dynamic Adjacency Matrix 

The purpose of using attention to compute the adjacency matrix is to dynamically adjust 

the graph structure based on the input features at each time-step. This dynamic adjustment 

allows the model to capture the changing relationships between nodes in the graph over 

time. The dynamic adjacency matrix applies linear transformations to the node features 

in order to obtain query and key representations then computes attention scores using dot 

product attention mechanism, and applies softmax to obtain a normalized adjacency 

matrix.  


37 

The DyGAT uses one time-step at time to compute the dynamic adjacency matrix. The 

input node features Xn at time-step t is of shape [batch_size, n_nodes, 

nodes_feature_size]. First, Query (Q) and Key (K) are computed as follows: 

Q = X .WQ     (3.12) 

K = X .WK    (3.13) 

Where,   WQ and WK  are learnable weight matrices. Then we compute the attention 

weights (A) using softmax function, 

A = softmax(
Q.K

√dk
)  (3.14) 

Where dk is the dimension of the key vectors. Now we construct the adjacency matrix: 

Adj = A×AT    (3.15) 

3.2.1.2 Dynamic GAT Layer 

This is the core component in the DyGAT module, where two sets of attention 

mechanisms are used to capture the spatial dependencies between nodes at each time-

step.  The first attention mechanism ‘nodes attention’ is a graph self-attention mechanism 

similar to the one used in the original GAT, and in STGAT [73],  the difference here is 

that STGAT used adjacency matrix that is initialized as a learnable parameter, while we 

utilized attention mechanism to create a true dynamic adjacency matrix that changes with 

each time-step.  

After the dynamic adjacency matrix is created, node attention uses it as the basis for 

computing attention scores between nodes. The adjacency matrix influences which nodes 

are considered neighbors and how much weight each neighbor gets when aggregating 

features. To compute node attention weights, the Q and K are first computed using 

equations (3.12) and (3.13), then the attention scores are computed using leaky ReLU 

function,  

eij = LeakyReLU(Qi. Kj) (3.16) 


38 

where eij  represents the attention score between node i and node j, and leaky ReLU is 

define as follows: 

leakyReLU (x) = {
0.01x    for x < 0
x            for x ≥ 0

    (3.17) 

Then, the attention scores are used to compute the weighted sum of neighboring node 

representations: 

nodes_attention = ∑ softmax(eij). Xjj     (3.18) 

This process is repeated across different attention heads then the outputs of the multi-

head attention are either concatenated or averaged, depending on the user’s choice, so 

we made this a hyper-parameter. 

The second attention mechanism ‘edge attention’, which is applied to the edge features 

we created during the feature engineering phase. We reviewed different attention 

mechanism and found that using additive attention [74] was the most suitable due to the 

nature of the directional edge features. These directional features are specific to each pair 

of nodes and require a more focused attention mechanism that can address these pairwise 

dependencies.   

The edge features input vector Xe at time-step t is of shape [batch_size, n_nodes, n_nodes, 

edge_feature_size]. First, Q and K are computed using equations (3.12) and (3.13), and 

then the Q and K are combined using an additive operation and passed through a 

hyperbolic tangent activation function (tanh): 

Z = tanh(Q + K)    (3.19) 

After that, Z is passed through another linear transformation to obtain attention scores, 

this transformation is done through a simple neural network called ‘Linear’ in PyTorch 

and it is defined like this:  

Y = xAT + b    (3.20) 


39 

Where x is the input tensor, A is a learnable weight matrix, and b is bias. After passing Z 

to the linear layer, we get the attention scores, which are then passed through a softmax 

function to get the edge attention weights. 

The edge attention weights are then used to modulate the nodes features to get another 

perspective of the spatial dependencies this time based on the edge features. Then the 

output is added to the multi-head attention outputs to be concatenated or averaged. By 

combining node self-attention to model how each node influences all other nodes, and 

edge additive attention to capture the weight of interactions between each pair of nodes 

based on directional features, DyGAT can create a richer spatiotemporal representation. 

3.2.1.3 Final DyGAT Output 

Each Dynamic GAT Layer returns a modified representation of the input that includes the 

spatial dependency for multiple time-steps. The final vector represents a spatiotemporal 

representation of the entire input sequence, which will be either the final output of 

DyGAT or the input to the next Dynamic GAT Layer, if multiple layers are used. 

3.2.2 Informer 

The Informer [46] is a time-series Transformer variant introduced by Zhou et. al. to solve 

some of the issues   in the Transformers when they are used for long-range time-series 

forecasting. The Informer presented three major modifications to the vanilla Transformer. 

The first modification was introducing ProbSparse self-attention mechanism to reduce 

the Transformer’s quadratic time complexity. It has achieved a O(Llog L) time 

complexity. ProbSparse self-attention reduces the time complexity by selectively 

attending to the most relevant parts of the data.  

The second major modification was creating a self-attention distilling technique to 

efficiently handle extremely long sequences. Finally, the model employs a generative 

style decoder that predicts long time-series sequences in a single forward operation to 

enhance the speed of inference for long-sequence predictions significantly.  

The Informer model is composed of an encoder and a decoder, both of which utilize a 

combination of multi-head self-attention and ProbSparse self-attention layers. The 

encoder processes the input sequence, while the decoder generates the predicted output 

sequence. Here is a more detailed description of the Informer architecture: 


40 

1. Encoder: It processes a sequence of input features to effectively capture the temporal 

dependencies. It employs multiple layers of multi-head ProbSparse self-attention. It 

utilizes self-attention distilling to addresses redundancy in the encoder's feature maps, 

by prioritizing features with dominant information. It reduces the time complexity 

significantly through a max-pooling operation and convolutional layers. The number 

of self-attention distilling layers decreases progressively in each layer, forming a 

pyramid structure. 

2. Decoder: It generates long sequential outputs efficiently. It uses the standard 

Transformer decoder with layers of multi-head self-attention. It utilizes “Masked 

Multi-head Self-Attention” to prevent the decoder from attending to future tokens 

during training to maintain autoregressive property. The decoder incorporates 

generative inference, where it generates the entire output sequence in a single forward 

pass to accelerate the decoding process. 

The output of our DyGAT model is concatenated with the original input embeddings to 

provide stability during training, and provide the informer with a version of the original 

temporal data before adding the dynamic spatial dependencies representations. The user 

can choose to directly use the DyGAT output as input to the Informer, or to concatenate 

it with the input embeddings. We made the concatenation between DyGAT output and 

the input embeddings as an option in the hyper-parameters, however, as mentioned 

before, during hyper-parameters tuning we found that concatenation produced a better 

performance, so it is the default option. The overall architecture of the Informer is 

presented in Figure (A.1) in Appendix A. 

 
41 

Chapter Four 

Experimental Design and Setup 

This chapter outlines the experimental design and setup employed to evaluate the 

performance of the proposed hybrid spatiotemporal PM2.5 forecasting model. We begin 

by introducing the baseline models against which our model’s performance will be 

evaluated. We will provide an overview of each model's architecture and the rationale for 

choosing them. Following this, the experimental setup are described, including hardware 

and software specifications, model training and testing parameters, hyper-parameters 

tuning process and the evaluation metrics that were used. 

4.1 Baseline Models 

The first phase of the study aimed to test the first hypothesis, which stated that using a 

hybrid model addressing both spatial and temporal dependencies in air quality data would 

result in more accurate PM2.5 forecasts at individual stations compared to using only a 

time-series forecasting model. To evaluate this, we employed our primary model, 

DyGAT-Informer. We used the Beijing dataset to train both our DyGAT-Informer model 

and an Informer model without DyGAT to verify whether capturing spatial dependencies 

between the measuring stations would improve forecasting results. 

 In addition, we also combined our DyGAT model with other time-series forecasting 

models to evaluate the performance of the temporal component of the hybrid model. 

These models are a Seq2Seq LSTM model and Autoformer. In addition, we compared 

our model with STGAT [73] because we used their version of GAT as a foundation of 

our model. We believed that this comparison would provide insights into the effectiveness 

of the modifications in our DyGAT variant. Finally, we chose a PyTorch implementation 

[75] of a model [76] that uses GCN as their spatial dependency computation module.  

The following is a detailed description of the architectures for the four baseline models: 

1- DyGAT-Autoformer 

This hybrid model combines our DyGAT with the Autoformer [67], a time-series 

forecasting Transformer that utilizes signal decomposition technique that breaks down 

the time-series data into component representing trend, seasonal, and residual parts of the 


42 

time-series. Signal decomposition is supposed to help the model in understanding the 

underlying structure of the data. We selected this Transformer due to its high performance 

demonstrated in the literature. The model uses Autocorrelation Mechanism instead of 

self-attention mechanism used in Transformers. It focuses on identifying periodic patterns 

within the data, and their relationships, which enables the model to capture long-range 

dependencies. 

2- DyGAT-LSTM 

We combined our DyGAT model with an LSTM-based encoder-decoder model, which is 

typically used for sequence-to-sequence tasks like time-series forecasting. The Seq2Seq 

LSTM model was a part of a research paper about spatiotemporal wind speed forecasting 

model by Bentsen et. al [77]. Since LSTMs are popular and powerful time-series 

forecasting models, we found it interesting to combine an LSTM with DyGAT in a hybrid 

model and compare its performance to that of the DyGAT-Informer. 

In this LSTM model, the encoder processes the input sequence using a network of 

multilayer LSTM, which transforms the input data into a series of internal 

representations, known as “hidden states” that capture the sequential dependencies and 

patterns in the input. These internal representations are then passed to the decoder. The 

decoder also uses multilayer LSTM network that generates the output sequence step-by-

step by leveraging the information from the hidden states provided by the encoder. The 

model uses a training strategy called “recursive strategy”, where each prediction is used 

as the input for the next time-step, allowing the model to iteratively refine its forecasts 

based on previous outputs.  

3- ST-GAT 

The ST-GAT model architecture combines a GAT with temporal convolution to handle 

spatiotemporal data. The model have a “TimeBlock” used to capture temporal 

dependencies for each node separately. The temporal convolution is done using Gated 

Temporal Convolutional Layer (GTCN), which helps the model learn temporal patterns 

by applying a series of convolutional operations over time. The TimeBlock have several 

GTCN layers. After the temporal processing the model uses GAT layers to capture spatial 

dependencies. The GAT layers use node attention mechanism to dynamically weigh the 

importance of neighboring nodes. After the temporal and spatial processing of the data, a 


43 

final output layer transforms the final feature representations into the desired forecast 

window. The overall ST-GAT architecture is presented in Figure (A.2 - a) in Appendix 

A, and the Figure (A.2 - b) illustrates the structure of GTCN.  

4- ST-GCN  

The Spatiotemporal graph convolutional network (ST-GCN), have similar design to ST-

GAT, where the model have spatiotemporal blocks, and it uses temporal convolution to 

process the temporal dependencies. However, this