Joint explainable and fair AI in healthcare | BMC Medical Informatics and Decision Making

To evaluate our proposal, we considered three well-known and publicly available healthcare datasets containing data on diabetes patients, pneumonia, and NIH Chest X-ray images, respectively. We have made the evaluation codes available.^{Footnote 1}

We conducted a comprehensive evaluation using widely adopted fairness metrics and explainability techniques to evaluate the efficacy of ECGL in enhancing the overall fairness and explainability of machine learning models. We evaluated the efficacy of our proposed ECGL in (i) mitigating biases that may lead to discriminatory outcomes and (ii) enhancing transparency into the model’s decision-making process.

Table of Contents

Evaluation

We utilized two prominent techniques to interpret and explain the predictions of models: SHAP [47] and GradCAM [48]. We employed SHAP (SHapley Additive exPlanations) values to interpret the importance of features in the models’ predictions (first experiment with the diabetes dataset). SHAP values provide a measure of how much each feature contributes to the difference between a prediction and the average prediction of the model. Therefore, SHAP values can reveal the most influential features in a model’s prediction, feature interaction, and global and local explanations.

For the second experiment with the pneumonia X-ray image dataset, we used GradCAM (Gradient-weighted Class Activation Mapping) to visualize the regions of the images that the model focused on when making predictions. GradCAM generates heatmaps that highlight the areas of the image that contribute most to the model’s classification. These heatmaps offer insights into how the models are making decisions, potentially revealing biases or limitations. By analyzing the regions highlighted by the heatmaps, we can identify which features (e.g., lung nodules, anatomical structures) are most important for the models’ predictions.

The utilized metrics on fairness are presented in [33, 34, 49] and reported below. The insights derived from fairness metrics can effectively guide the model development process toward mitigating biases. In our experiments, dealing with medical datasets, we considered the Equalized Odds Difference (EOD) and Ratio (EOR) as the fairness metric. This metric evaluates the difference and ratio of the true positive and false positive rates between protected groups, respectively. A difference value of 0 and a ratio closer to 1 suggest that the model is equally likely to classify individuals from different groups correctly or incorrectly. EOD and EOR focus on the fairness of the model’s predictions rather than the overall outcomes, ensuring that the conditional probability of a positive outcome given a positive or negative prediction is equal across different demographic groups. Moreover, equalized odds does not enforce the same positive prediction rates across different groups, allowing it to better reflect differences in base rates of outcomes. This is particularly important in domains where certain conditions may be more prevalent in specific demographic groups.

Additionally, we have utilized common metrics, including Area Under Curve (AUC) and Accuracy to measure the model’s performance. Overall, higher AUC and Accuracy values indicate a more accurate model, while lower EOD and higher EOR represent a fairer model.

Experimental settings

To evaluate our proposal, we considered three well-known and publicly available healthcare datasets containing data on diabetes patients and pneumonia and NIH Chest X-ray images, respectively. The experiments on diabetes and pneumonia datasets have been conducted on a system with 2 Intel Xeon virtual CPUs and 12 GB of RAM, using the Python programming language and TensorFlow library. The experiment on the NIH Chest X-ray images was conducted using 2 T4 GPUs, 30 GB of RAM, and 16 GB of GPU RAM. “Fairlearn” and “SHAP” libraries have been used for fairness and explainability evaluation purposes, respectively. The GradCAM visualization has been implemented by the authors using the “Keras” library. We have also used the “Scikit-learn” library for data preprocessing and standardization.

Experiment I: diabetes dataset

The diabetes dataset^{Footnote 2} is a widely used benchmark in the healthcare domain. This dataset comprises 768 instances with eight attributes, including gender, body mass index (BMI), blood pressure, skin thickness, insulin level, glucose level, and diabetes pedigree function. The target variable is a binary class indicating the presence or absence of diabetes. We evaluated the performance of the proposed model across a range of different metrics, including EOD, EOR, Accuracy, and AUC.

Model selection

The model architecture of the examined diabetes deep learning network is illustrated in Fig. 1. We utilized a sequential neural network, consisting of three dense layers with decreasing units (128 \(\rightarrow\) 64 \(\rightarrow\) 32), interspersed with dropout layers to prevent overfitting. The final layer has two output units with a softmax activation function, producing probabilities for the two classes, referring to the positive or negative case of diabetes.

As for the training, we used the sparse categorical cross-entropy loss function and the “Adam” optimizer. The model has been trained considering 100 epochs in total, with early stopping applied to avoid the potential of overfitting.

Explanation constraints

To effectively define explanation constraints in our model, we conducted a comprehensive analysis of the dataset and leveraged expert domain knowledge. This analysis allowed us to identify the relevant ranges and the correlations of data features with the likelihood of a positive diabetes diagnosis. We identified the potential probability of positive and negative diabetes cases corresponding to the specified feature ranges reported in Table 1. We defined the corresponding constraints and used them to refine and guide the model’s predictions towards classification labels aligning with the determined probabilities. Noteworthy, we consider the normalized values of these conditions to better align with loss values while training the model.

Table 1 Specific conditions used in experiment I. LB and HB refer to lower and higher bounds of feature values, respectively. P is the empirical probability of diabetes for the specified range. Convexity indicates whether the constraint region defined by LB and HB is convex

In particular, each condition in Table 1 specifies a corresponding constraint violation function \(g(\theta)\), which evaluates the inconsistency between the model predictions and expert probabilities under a given range of a feature. For a sample whose features lay in a specified range (say Glucose > 170), this constraint violation is assessed concerning the probabilities output by the model. For instance, should domain knowledge indicate a high degree (e.g., 0.84) of a positive diagnosis for the high glucose class, then this constraint term penalizes the model when it predicts low probabilities for this class. The violation itself is expressed as a smooth and differentiable function of model outputs, which measures the model’s failure to conform with the expert knowledge. Such a formulation easily lends itself to gradient-based optimization in the Augmented Lagrangian framework.

The set of constraints in Experiment I is all defined over closed intervals or threshold-based conditions for individual features; hence, they define convex feasible regions in the input space. In Table 1, the classification of each condition with respect to convexity is presented. This makes it compatible with convex constraint optimization techniques; however, our framework remains applicable even in non-convex scenarios because of its iterative penalty-based scheme.

Results and discussion

The Base model was trained to optimize the output accuracy without any explanation constraints. Table 2 compares the base network with ECGL variants that impose explanation constraints on individual features or on all features simultaneously. The All-constraints model achieves the highest accuracy (0.7857, +3.89% over Base Model) but does not improve fairness (EOD = 0.3399, identical to the Base model; EOR = 0.3174 vs. 0.3265). In contrast, several single-feature variants provide meaningful fairness gains: the Pregnancies-only model reduces EOD by 43% (0.1931) and raises AUC by 4.84% (0.7646). Overall, imposing carefully chosen explanation constraints can lower bias while preserving—or slightly improving—predictive performance; combining all constraints maximizes accuracy at the cost of fairness.

Table 2 Performance of ECGL. base model is the model without any explanation constraint. X model incorporates explanation constraints for X, where \(X \in \{\)Pregnancies, glucose, blood pressure, skin thickness, Insulin, BMI, DPF, and all Constraints\(\}\). Bold represents the best in each column

Explainability analysis by using the SHAP values is presented in Table 3. As shown in Table 3, the feature importance values generally experience an increase when the associated explanation constraints are applied during model training. For instance, integrating the constraint associated with Pregnancies led to an approximate 9.6% rise in the importance of this feature in the model’s final prediction. Similarly, several features, such as Skin Thickness and Insulin, exhibited substantial increases in their SHAP importance values. However, a few features, such as BMI and DPF, showed a slight decrease. Moreover, by combining all constraints, the All Constraints model achieves an overall improvement in feature importance alignment compared to the Base Model, supporting the effectiveness of ECGL in steering the model towards clinically relevant explanations.

Table 3 Comparing the importance of SHAP values obtained by applying the constraint-based ECGL models with the base Model, and the corresponding improvement percentages

SHAP values can also help identify feature interactions, which may be crucial for understanding the model’s behavior. This means that the constraints, including one feature, may also affect how the model adheres to another one. As an example, applying constraints associated with BMI indicates massive improvements across almost all constraints (+93.5% to +234.9%). This suggests that BMI could be considered a highly sensitive parameter to constraints with a critical role in predictions. The Insulin under the Diabetes Pedigree Function (DPF) constraint shows the highest value in the table. BMI and Insulin might be seen as the most “constraint-sensitive” features, since the gains are larger compared to other constraints.

Figure 2 presents SHAP summary plots for the Base model and the proposed ECGL (All Constraint model), which provide a visual representation of the feature’s contribution to the prediction. The color and size of the dots in SHAP summary plots represent the feature’s contribution, while their position (being positive or negative) indicates whether the feature increases or decreases the prediction outcome. In Fig. 2(a) and 2(b), the Base model highlights Glucose and BMI as the most influential features, followed by DPF. Features like Pregnancies, Blood Pressure, Insulin, and Skin Thickness show relatively lower importance. However, the All Constraint model demonstrates a shift in feature importance. While Glucose and BMI remain the most significant features, the importance of Insulin has increased significantly due to the integration of explanation constraints. For instance, the Base model assigns an importance value close to 0 or negative for Insulin, whereas the All Constraint model shows a larger range with more positive and extreme values. Similarly, the features Pregnancies, Blood Pressure, and Skin Thickness exhibit increased importance, indicating that the constraints effectively guide the model to focus on these relevant features.

Figures 2(c) and 2(d) illustrate the mean absolute SHAP values for different features, highlighting their relative importance in predicting the target variable for the Base model and the All Constraint model. These plots show that applying constraints has slightly changed the overall feature importance. While Glucose remains the most influential feature, the importance of other features has increased, indicating that the constraints have guided the model to focus more on these features. For example, the importance value of Glucose increased from 0.175 in the Base model to 0.20 in the All Constraint model. Additionally, the constraints had a significant effect on features such as Insulin, Pregnancies, and Skin Thickness, which were not utilized properly in the Base model. This means that the constraints enhanced the model’s learning process, allowing it to better capture the relevance of these features, resulting in improved predictions and interpretations.

Experiment II: pneumonia X-ray images

In this section, we explore the fairness and explainability performance of the proposed ECGL on the pneumonia classification task, using an X-ray image dataset.^{Footnote 3} This dataset contains 30,000 frontal-view chest radiographs from the 112,000-image public National Institutes of Health (NIH). We employed GradCAM to gain insights into the model’s decision-making process, enabling us to understand the rationale behind its predictions and identify the improvements gained by applying explanation-guided learning on the model.