EXPLORING THE CAPACITY AND PERFORMANCE OF SUPERVISED LEARNING METHODS FOR LABEL CLASSIFICATION IN CAUSAL INFERENCE: A COMPARATIVE STUDY

Abu Saqer, Ola

EXPLORING THE CAPACITY AND PERFORMANCE OF SUPERVISED LEARNING METHODS FOR LABEL CLASSIFICATION IN CAUSAL INFERENCE: A COMPARATIVE STUDY

Date

2024-08-08

Authors

Abu Saqer, Ola

Publisher

An-Najah National University

Abstract

In fact, discussions about machine learning are increasingly prevalent due to its accuracy in prediction and its ability to handle vast amounts of data. Furthermore, many relationships in life are causal, which motivates the efforts to comprehend the cause-and-effect relationships among variables. For instance, understanding the extent of the effect of a particular medicine on an individual with an illness becomes crucial. While it might seem straightforward at first glance, a deeper examination tell the complexity inherent in such endeavors when using machine learning in causality. Machine learning methods have made a valuable contribution to the field of causal inference because unlike traditional approaches, machine learning methods offer greater flexibility in estimating causal effects, since machine learning techniques do not require modelling hypotheses., yet there is still a research in estimation causal effect when both treatment and outcome are binary variables, because machine learning has proven its ability to predict, and prediction does not mean causality. Perhaps this is the challenge for machine learning in obtaining more accurate and less biased estimates of causal effects. This study conducts a comparative analysis of supervised learning methods for label classification in causal inference. We evaluate the performance and capacity of four techniques: Causal Forest (CF), Support Vector Machine (SVM), Generalized Linear Models (GLM), and Linear Probability Models (LPM) in estimating the causal effects for categorical response variable. In a randomized controlled trial simulation and real experiments were performed to evaluate the methods’ performance under varying conditions, by xi changing the main characteristics of the data including the sample size, and the number of the explanatory variables. We have focused on these four methods because of their specific advantages: Causal Forests are particularly adept at making causal inferences easily; Support Vector Ma chines are recognised for their effectiveness in binary classification tasks; Generalised Linear Models are well established as optimal for modelling the binary response vari able; and Linear Probability Models are used for their ability to provide predictions as probabilities. The results provide valuable insights into the strengths and limitations of each method in each scenario in the causal effects simulation study. Furthermore, the methods are able to detect heterogeneity in the real data results, and it was expected that SVM, GLM and LPM would detect more heterogeneity than Causal Forest. This thesis helps us to improve our knowledge of machine learning techniques in causal inference and emphasizes the importance of carefully evaluating their performance in real-world applications