Artificial intelligence (AI) has become prevalent in many day-to-day decision-making processes. However, studies have found that AI models risk making discriminatory predictions. In response, bias evaluation frameworks have been developed to quantify the bias in particular models and provide methods to mitigate it. This survey examines the current state of bias evaluation frameworks and identifies possible improvements. Each framework is evaluated based on its reliability, generalisability, guidance, and robustness. My main finding is that all frameworks are restricted to address fairness as only equality, rather than equity. They also lack clarity and guidance for new users to differentiate their metrics.
How Do We Know That Artificial Intelligence Models Are Fair? An Overview of Bias Evaluating Frameworks for AI Models
Artificial Intelligence and Fairness | Kejun Dai
Artificial Intelligence’s Fairness Problem
Artificial intelligence (AI) is a developing technology that has exploded into widespread popularity in recent years. Its ability to enable machines to generate accurate decisions that can rival experts in the same fields attracts attention from many businesses and government organisations. As a result, AI models are increasingly implemented in decision-making processes, even influencing decisions that are critical to individuals’ lives, such as loan approval [1], job recruits [2], school admission [3], and credit card risk prediction [4].
However, AI models must be trained with large amounts of data to extract sufficient knowledge to achieve expert-level accuracy. They are not immune to real world bias in the training data, and can make discriminatory decisions [5]. For example, COMPAS, a criminal risk assessment program, was more likely to falsely label black individuals with a high risk of reoffense than white individuals [6]. In another case, an algorithm designed to promote job advertisements in the STEM fields showed fewer ads to women than men, which was not the intention of its authors [7].
Fairness in machine learning is an emerging area of research within the wider field of AI that intends to address this issue. Current research involves developing algorithms to mitigate biases in AI models during the phases of data collection, model training, and development phases. It encompasses the development of methods and tools to help quantify bias in AI models using fairness metrics and auditing the degree of fairness with which they operate. However, the inconsistencies in model fairness are a multifaceted social problem, and researchers have not yet agreed upon a definitive set of fairness metrics. As a result, a diverse range of fairness metrics have been proposed and are used in fairness machine learning research.
Bias evaluation frameworks are tools that allow users to calculate fairness metrics of their AI models, modify these models by implementing bias mitigation techniques, and sometimes automate users’ benchmark experiments.
Method
My summer research project surveyed state-of-the-art bias evaluation frameworks to identify any potential improvements that could be made in the future. My methodology involved learning each framework and demonstrating its capabilities with benchmark experiments. Then, I evaluated the different frameworks, focusing on aspects such as reliability, generalisability, guidance, and robustness. In this context, reliability refers to how unlikely the frameworks are to encounter problems during operations. Generalisability is the framework’s compatibility with models and datasets from different AI ecosystems. Guidance reflects a new user’s ease or difficulty in understanding and utilising the frameworks. Robustness entails the frameworks’ capability to evaluate bias in different AI models.
Table 1: Summary of my assessments of bias evaluation frameworks covered in the survey. ML refers to machine learning models, and LM to language models.
Results
In my research survey, I encountered AI Fairness 360 [8], Fairness Indicators [9], Aequitas [10], Fairpy [11], Evaluate [12], and HELM [13]. These bias evaluation frameworks can be classified into two types: standard and subordinate (Table 1).
Standard frameworks are usually developed by researchers to allow the AI industry to adopt their findings easily. As a result, these frameworks often offer a diverse range of fairness metrics and are usually equipped with state of the art bias mitigation methods. Aequitas is the most advanced candidate among the standard frameworks thanks to its robust assessment capability. It can provide extensive catalogues of fairness metrics, including 11 absolute metrics and their group disparity. It can also help users interpret these metrics by translating them into digestible reports using threshold-based tests. In addition, it is less demanding regarding technology and workflow compatibility, meaning that it is much easier to adopt than its counterparts. One shortfall of Aequitas is that it is built to evaluate predictions from a single model, making it less convenient to evaluate multiple epochs of single or multiple models.
Figure 1: Aequitas’ visualisation of the result of a model’s evaluation [10].
The future development of bias evaluation frameworks should focus on creating and utilising new fairness metrics with other definitions. According to a similar study, we can explore fairness metrics with definitions based on predicted probability, ground truth, and the similarity between predictions and causal reasoning [14]. In addition, bias evaluation frameworks should introduce new guiding materials for choosing bias metrics. These materials do not need to be extensive; even an example scenario or a short sentence suggesting the metrics’ application would suffice. Another area that future development should focus on is the bias evaluation frameworks for language models. Currently, frameworks for language models are less competent than their counterparts for machine learning models. However, researchers are developing a diverse range of bias evaluation and mitigation methods [11]. It is possible to create a bias evaluation framework with the equivalent evaluating capabilities of Aequitas for language models.
In conclusion, the survey provides a brief overview of the current state of standard and subordinate bias evaluation frameworks. It also finds that most of them disproportionately treat fairness as only equality rather than equity through their rosters of fairness metrics.
The other type of bias evaluation framework is the subordinate framework. These frameworks tend to be developed by a prominent actor in the AI industry, generally as their response to the fairness problem. As a result, they are small libraries belonging to a larger AI ecosystem. Subordinate frameworks usually offer only a few fairness metrics and lack features with which standard frameworks would be equipped. However, they benefit from better integration with their ecosystem than standard frameworks. Fairness Indicators is an example of a subordinate framework. It is built at the top of the model analysis library of the Tensorflow ecosystem. Its instalment also comes with the What-If tool, which visualises the effect on Tensorflow models’ performance after editing them [9]. Compared to Aequitas, it offers fewer absolute metrics and lacks their group disparity counterpart. Moreover, it only compares one fairness metric across different groups at a time, making analysing the tradeoff between models’ performance and fairness metrics much more tedious. However, it provides a much more in-depth evaluation of Tensorflow models by cooperating with other libraries in the Tensorflow ecosystem [9].
Discussion
Several common problems have arisen across all bias evaluation frameworks. One is that the fairness metrics they provide are limited to definitions based on the model’s prediction and ground truth. In other words, they perceive fairness as equality, rather than equity. Consequently, they may not actually contribute anything useful to the fairness problem in scenarios where equitable predictions are preferred over equal predictions. Another common problem is that these frameworks do not provide enough guiding materials to new users regarding which fairness metrics they should employ. These frameworks all assume that users have a complete understanding of each metric’s usage and limitations and can choose the best metric suited for their situations. Unfortunately, this assumption does not apply to users who are new to model fairness evaluations. The available metrics are extensive yet similar, and leave new users needing clarification and guidance when selecting between them.
Figure 2: Fairness Indicators’ visualisation of the result of a model’s evaluation [9].
Acknowledgements
I would like to thank my supervisors, Professor Gill Dobbie and Dr. Vithya Yogarajan, for guiding me in choosing the research subjects and providing helpful information when I was searching for bias evaluation frameworks.
[1] A. Mukerjee, R. Biswas, Ý. Kalyanmoy, D. Amrit, and P. Mathur, “Multi-objective Evolutionary Algorithms for the Risk-return Trade-off in Bank Loan Management,” International Transactions in Operational Research, vol. 9, no. 5, pp. 583–597, Mar. 2002, doi: 10.1111/1475-3995.00375.
[2] E. Faliagka, K. Ramantas, A. Tsakalidis, and G. Tzimas, “Application of Machine Learning Algorithms to an Online Recruitment System” in International Conference on Internet and Web Applications and Services 2012, pp. 215–220, doi:10.1111/1475-3995.00375.
[3] J. S. Moore, “An expert system approach to graduate school admission decisions and academic performance prediction,” Omega, vol. 26, no. 5, pp. 659–670, Oct. 1998, doi:10.1016/S0305-0483(98)00008-5.
[4] I. Yeh and C. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, vol. 36, no. 2, pp. 2473–2480, Mar. 2009, doi:10.1016/j. eswa.2007.12.02.
[5] D. Pessach and E. Shmueli, “A review on fairness in machine learning,” ACM Computing Surveys, vol. 55, no. 3, pp. 1-44, Feb. 2022. [Online]. Available: https://dl.acm.org/doi/ abs/10.1145/3494672
[6] J. Angwin, J. Larson, S. Mattu and L. Kirchner, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.” ProPublica, https:// www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed Feb. 21, 2024).
[7] A. Lambrecht and C. E. Tucker, “Algorithmic Bias? An Empirical Study into Apparent Gender-Based Discrimination in the Display of STEM Career Ads,” Mar. 2018. Available: https://ssrn. com/abstract=2852260
[8] R. Bellamy et al., “AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias,” IBM Journal of Research and Development, vol. PP, pp. 1–1, Sep. 2019, doi: 10.1147/JRD.2019.2942287.
[9] Tensorflow. “Fairness Indicators.” GitHub, https://github.com/tensorflow/fairness-indicators (accessed Feb. 21, 2024).
[10] P. Saleiro et al., “Aequitas: A bias and fairness audit toolkit,” arXiv, Nov. 2018. doi:10.48550/arXiv.1811.05577. [Online]. Available: https://arxiv.org/abs/1811.05577
[11] H. Viswanath and T. Zhang, “FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models,” arXiv. doi:10.48550/arXiv.2302.0550. [Online]. Available: https://arxiv.org/abs/2302.05508
[12] HuggingFace. “Evaluating Language Model Bias with Evaluate.” GitHub, https://github.com/huggingface/blog/blob/main/evaluating-llm-bias.md (accessed Feb. 21, 2024).
[13] P. Liang et al., “Holistic Evaluation of Language Models,” arXiv. doi:10.48550/arXiv.2211.09110. [Online]. Available: https://arxiv.org/abs/2211.09110
[14] B. Richardson, J. Garcia-Gathright, S. F. Way, J. Thom and H. Cramer, “Towards fairness in practice: A practitioner-oriented rubric for evaluating Fair ML Toolkits,” in Conference on Human Factors in Computing Systems, May 2021, pp. 1-13. [Online]. Available: https://dl.acm.org/doi/abs/10.1145/3411764.3445604
Kejun is a fourth year student interested in machine learning and its fairness dilemma. He is currently participating in his Honours project, working on utilising meta-learning to help address the dilemma.