Evaluating AI fairness in credit scoring with the BRIO tool
Greta Coraglia, Francesco A. Genco, Pellegrino Piantadosi, Enrico Bagli, Pietro Giuffrida, Davide Posillipo, and Giuseppe Primiero
ABSTRACT of Evaluating AI fairness in credit scoring with the BRIO tool
We present a method for quantitative, in-depth analyses of fairness issues in AI systems with an application to credit scoring. To this aim we use BRIO, a tool for the evaluation of AI systems with respect to social unfairness and, more in general, ethically undesirable behaviours.
It features a model-agnostic bias detection module, presented in [CDG+23], to which a full-fledged unfairness risk evaluation module is added. As a case study, we focus on the context of credit scoring, analysing the UCI German Credit Dataset [Hof94a].
We apply the BRIO fairness metrics to several, socially sensitive attributes featured in the German Credit Dataset, quantifying fairness across various demographic segments, with the aim of identifying potential sources of bias and discrimination in a credit scoring model. We conclude by combining our results with a revenue analysis.
Introduction to Evaluating AI fairness in credit scoring with the BRIO tool
In recent years, the integration of Artificial Intelligence (AI) into various domains has brought forth transformative changes, especially in areas involving decision-making processes. One such domain where AI holds significant promise and scrutiny is credit scoring.
Traditionally, credit scoring algorithms have been pivotal in determining individuals’ creditworthiness, thereby influencing access to financial services, housing, and employment opportunities. The adoption of AI in credit scoring offers the potential for enhanced accuracy and efficiency, leveraging vast datasets and complex predictive models [GP21]. Nevertheless, the inherently opaque nature of AI algorithms poses challenges in ensuring fairness, particularly concerning biases that may perpetuate or exacerbate societal inequalities. Fairness in credit scoring has become a paramount concern in the financial industry.
According to the the AI act and to the European Banking Authority guidelines—which state that “the model must ensure the protection of groups against (direct or indirect) discrimination” [Eur20]—ensuring fairness and the prevention/detection of bias is becoming imperative. Fairness is fundamental to maintaining trust in credit scoring systems and upholding principles of social justice and equality. Biases in credit scoring algorithms can stem from various sources, including historical data, algorithmic design, and decision-making processes, thus necessitating the development of robust fairness metrics and frameworks to mitigate these disparities [Fer23, BCEP22, NOC+21].
Various metrics have been proposed to evaluate the fairness of credit scoring algorithms, encompassing disparate impact analysis, demographic parity, and equal opportunity criteria: disparate impact analysis examines whether the outcomes of the algorithm disproportionately impact protected groups; demographic parity ensures that decision outcomes are independent of demographic characteristics such as race, gender, or age; equal opportunity criteria focus on ensuring that individuals have an equal chance of being classified correctly by the algorithm, irrespective of their demographic attributes. Still, several challenges persist in implementing fair algorithms. One key challenge is the trade-off between fairness and predictive accuracy, as optimizing for one may inadvertently compromise the other. Moreover, biases inherent in training data, algorithmic design, and decision-making processes can perpetuate unfair outcomes, necessitating careful consideration and mitigation strategies.
The literature on fairness detection and mitigation in credit scoring has seen significant advancements, with researchers proposing various methods to address biases and promote equitable outcomes [HPS16, FFM+15, ZVRG17, LSL+17, DOBD+20, BG24]. Hardt et al. [HPS16] examine fairness in the FICO score dataset, considering race and creditworthiness as sensitive attributes. They employ statistical parity and equality of odds as fairness metrics to assess disparities in credit scoring outcomes across demographic groups. In [FFM+15], Feldman et al. propose a fairness mitigation method based on dataset repair to reduce disparate impact, applying it to the German credit dataset [Hof94b].
They focus on age as the sensitive attribute and employ techniques to adjust the dataset to mitigate biases in credit scoring outcomes. Zafar et al. [ZVRG17] introduce a regularization method for the loss function of credit scoring models to mitigate unfairness with respect to customer age in a bank deposit dataset.
Their approach aims to prevent discriminatory outcomes by penalizing unfair predictions based on sensitive attributes. In [LSL+17] the authors propose the implementation of a variational fair autoencoder to address unfairness in gender classification within the German dataset. Their approach leverages generative modeling techniques to learn fair representations of data and mitigate gender-based biases in credit scoring. In [DOBD+20], Donini et al. Analyze another regularization method aimed at minimizing differences in equal opportunity within the German credit ranking. Their empirical analysis highlights the effectiveness of regularization techniques in promoting fairness and equity in credit scoring outcomes. Most recently, the work in [BG24] combines traditional group fairness metrics with Shapley values, though they admittedly may lead to false interpretations (cf. [AB22]) and should thus combined with counterfactual approaches.
While the existing tools and studies present different fairness analyses and bias mitigation methods, to the best of our knowledge none of them enables the user to conduct an overall analysis yielding a combined and aggregated measure of the fairness violation risk related to all sensitive features selected. Moreover, our approach is model-agnostic — while many others are not — while still allowing for bias mitigation considerations to be done.
We offer such a result using BRIO, a bias detection and risk assessment tool for ML and DL systems, presented in [CDG+23] and based on formal analyses introduced in [DP21, PD22, GP23, DGP24]. In the present paper, we showcase its use on the UCI German Credit Dataset [Hof94a] and present an encompassing, rigorous analysis of fairness issues within the context of credit scoring, aligning with the recent ethical guidelines. To operationalize these principles, we measure the fairness metrics over the sensitive attributes present in the German Credit Dataset, quantifying and evaluating fairness across various demographic segments, thereby seeking to identify potential sources of bias and discrimination.
The rest of this paper is structured as follows. In Section 2 we provide a preliminary illustration of the dataset under investigation, the features considered and the performance. In Section 3 we explain how we constructed a ML model trained on the dataset considered for credit score prediction, its evaluation and validation and the results on score distribution. In Section 4 we illustrate the theory behind bias identification and risk evaluation of BRIO. In Section 6 we present the results of risk evaluation on the UCI German Credit Dataset using BRIO. We conclude in Section 8 with further research lines.
ARE YOU A DEVELOPER?
Check out all the resources for TPPs and developers on the Crif Platform development portal.